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Introduction 


Engineering disciplines are those fields of research and development that 
attempt to create products and systems operating in, and dealing with, 
the real world. The number of disciplines is large, as is the range of scales 
that they typically operate in: from the very small scale of nanotechnol- 
ogy up to very large scales that span whole regions, e.g. water manage- 
ment systems, electric power distribution systems, or even global systems 
(e.g. the global positioning system, GPS). The level of advancement in 
the fields also varies wildly, from emerging techniques (again, nanotech- 
nology) to trusted techniques that have been applied for centuries (archi- 
tecture, hydraulic works). Nonetheless, the disciplines share one 
important aspect: engineering aims at designing and manufacturing 
systems that interface with the world around them. 

Systems designed by engineers are often meant to influence their 
environment: to manipulate it, to move it, to stabilize it, to please it, 
and so on. To enable such actuation, these systems need information, 
e.g. values of physical quantities describing their environments and 
possibly also describing themselves. Two types of information sources 
are available: prior knowledge and empirical knowledge. The latter is 
knowledge obtained by sensorial observation. Prior knowledge is the 
knowledge that was already there before a given observation became 
available (this does not imply that prior knowledge is obtained without 
any observation). The combination of prior knowledge and empirical 
knowledge leads to posterior knowledge. 
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The sensory subsystem of a system produces measurement signals. 
These signals carry the empirical knowledge. Often, the direct usage 
of these signals is not possible, or inefficient. This can have several 
causes: 


e The information in the signals is not represented in an explicit way. 
It is often hidden and only available in an indirect, encoded form. 

e Measurement signals always come with noise and other hard- 
to-predict disturbances. 

e The information brought forth by posterior knowledge is more 
accurate and more complete than information brought forth by 
empirical knowledge alone. Hence, measurement signals should 
be used in combination with prior knowledge. 


Measurement signals need processing in order to suppress the noise and 
to disclose the information required for the task at hand. 


1.1 THE SCOPE OF THE BOOK 


In a sense, classification and estimation deal with the same pro- 
blem: given the measurement signals from the environment, how 
can the information that is needed for a system to operate in the 
real world be inferred? In other words, how should the measure- 
ments from a sensory system be processed in order to bring max- 
imal information in an explicit and usable form? This is the main 
topic of this book. 

Good processing of the measurement signals is possible only if 
some knowledge and understanding of the environment and the 
sensory system is present. Modelling certain aspects of that environ- 
ment — like objects, physical processes or events — is a necessary task 
for the engineer. However, straightforward modelling is not always 
possible. Although the physical sciences provide ever deeper insight 
into nature, some systems are still only partially understood; just 
think of the weather. But even if systems are well understood, 
modelling them exhaustively may be beyond our current capabilities 
(i.e. computer power) or beyond the scope of the application. In such 
cases, approximate general models, but adapted to the system at 
hand, can be applied. The development of such models is also a 
topic of this book. 
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1.1.1 Classification 


The title of the book already indicates the three main subtopics it will cover: 
classification, parameter estimation and state estimation. In classification, 
one tries to assign a class label to an object, a physical process, or an event. 
Figure 1.1 illustrates the concept. In a speeding detector, the sensors are 
a radar speed detector and a high-resolution camera, placed in a box beside 
a road. When the radar detects a car approaching at too high a velocity 
(a parameter estimation problem), the camera is signalled to acquire an 
image of the car. The system should then recognize the license plate, so that 
the driver of the car can be fined for the speeding violation. The system 
should be robust to differences in car model, illumination, weather circum- 
stances etc., so some pre-processing is necessary: locating the license plate in 
the image, segmenting the individual characters and converting it into a 
binary image. The problem then breaks down to a number of individual 
classification problems. For each of the locations on the license plate, the 
input consists of a binary image of a character, normalized for size, skew/ 
rotation and intensity. The desired output is the label of the true character, 
i.e. one of ‘A’, ‘B’,..., P, ‘0’,..., ‘9. 

Detection is a special case of classification. Here, only two class labels 
are available, e.g. ‘yes’ and ‘no’. An example is a quality control system 
that approves the products of a manufacturer, or refuses them. A second 
problem closely related to classification is identification: the act of 
proving that an object-under-test and a second object that is previously 
seen, are the same. Usually, there is a large database of previously seen 
objects to choose from. An example is biometric identification, e.g. 


NL-LH-66 


| 
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Figure 1.1 License plate recognition: a classification problem with noisy measurements 
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fingerprint recognition or face recognition. A third problem that can be 
solved by classification-like techniques is retrieval from a database, e.g. 
finding an image in an image database by specifying image features. 


1.1.2 Parameter estimation 


In parameter estimation, one tries to derive a parametric description for 
an object, a physical process, or an event. For example, in a beacon- 
based position measurement system (Figure 1.2), the goal is to find the 
position of an object, e.g. a ship or a mobile robot. In the two- 
dimensional case, two beacons with known reference positions suffice. 
The sensory system provides two measurements: the distances from the 
beacons to the object, rı and r2. Since the position of the object involves 
two parameters, the estimation seems to boil down to solving two 
equations with two unknowns. However, the situation is more complex 
because measurements always come with uncertainties. Usually, the 
application not only requires an estimate of the parameters, but also 
an assessment of the uncertainty of that estimate. The situation is even 
more complicated because some prior knowledge about the position 
must be used to resolve the ambiguity of the solution. The prior know- 
ledge can also be used to reduce the uncertainty of the final estimate. 
In order to improve the accuracy of the estimate the engineer can 
increase the number of (independent) measurements to obtain an over- 
determined system of equations. In order to reduce the cost of the 
sensory system, the engineer can also decrease the number of measure- 
ments leaving us with fewer measurements than parameters. The system 


beacon 1 


prior 
knowledge 


O _ J 
object 


Figure 1.2 Position measurement: a parameter estimation problem handling uncer- 
tainties 


beacon 2 
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of equations is underdetermined then, but estimation is still possible if 
enough prior knowledge exists, or if the parameters are related to each 
other (possibly in a statistical sense). In either case, the engineer is 
interested in the uncertainty of the estimate. 


1.1.3 State estimation 


In state estimation, one tries to do either of the following — either 
assigning a class label, or deriving a parametric (real-valued) description — 
but for processes which vary in time or space. There is a fundamental 
difference between the problems of classification and parameter estima- 
tion on the one hand, and state estimation on the other hand. This is the 
ordering in time (or space) in state estimation, which is absent from 
classification and parameter estimation. When no ordering in the data is 
assumed, the data can be processed in any order. In time series, ordering 
in time is essential for the process. This results in a fundamental differ- 
ence in the treatment of the data. 

In the discrete case, the states have discrete values (classes or labels) 
that are usually drawn from a finite set. An example of such a set is the 
alarm stages in a safety system (e.g. ‘safe’, ‘pre-alarm’, ‘red alert’, etc.). 
Other examples of discrete state estimation are speech recognition, 
printed or handwritten text recognition and the recognition of the 
operating modes of a machine. 

An example of real-valued state estimation is the water management 
system of a region. Using a few level sensors, and an adequate dynamical 
model of the water system, a state estimator is able to assess the water 
levels even at locations without level sensors. Short-term prediction of 
the levels is also possible. Figure 1.3 gives a view of a simple water 
management system of a single canal consisting of three linearly con- 
nected compartments. The compartments are filled by the precipitation 
in the surroundings of the canal. This occurs randomly but with a 
seasonal influence. The canal drains its water into a river. The measure- 
ment of the level in one compartment enables the estimation of the levels 
in all three compartments. For that, a dynamic model is used that 
describes the relations between flows and levels. Figure 1.3 shows an 
estimate of the level of the third compartment using measurements of the 
level in the first compartment. Prediction of the level in the third com- 
partment is possible due to the causality of the process and the delay 
between the levels in the compartments. 
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Figure 1.3 Assessment of water levels in a water management system: a state 
estimation problem (the data is obtained from a scale model) 


1.1.4 Relations between the subjects 


The reader who is familiar with one or more of the three subjects might 
wonder why they are treated in one book. The three subjects share the 
following factors: 


e In all cases, the engineer designs an instrument, i.e. a system whose 
task is to extract information about a real-world object, a physical 
process or an event. 

e For that purpose, the instrument will be provided with a sensory sub- 
system that produces measurement signals. In all cases, these signals are 
represented by vectors (with fixed dimension) or sequences of vectors. 

e The measurement vectors must be processed to reveal the informa- 
tion that is required for the task at hand. 

e Allthree subjects rely on the availability of models describing the object/ 
physical process/event, and of models describing the sensory system. 

e Modelling is an important part of the design stage. The suitability 
of the applied model is directly related to the performance of the 
resulting classifier/estimator. 
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Since the nature of the questions raised in the three subjects is similar, the 
analysis of all three cases can be done using the same framework. This allows 
an economical treatment of the subjects. The framework that will be used is 
a probabilistic one. In all three cases, the strategy will be to formulate the 
posterior knowledge in terms of a conditional probability (density) function: 


P(quantities of interest|measurements available) 


This so-called posterior probability combines the prior knowledge with 
the empirical knowledge by using Bayes’ theorem for conditional prob- 
abilities. As discussed above, the framework is generic for all three cases. 
Of course, the elaboration of this principle for the three cases leads to 
different solutions, because the natures of the ‘quantities of interest’ 
differ. 

The second similarity between the topics is their reliance on models. 
It is assumed that the constitution of the object/physical process/event 
(including the sensory system) can be captured by a mathematical model. 
Unfortunately, the physical structures responsible for generating the 
objects/process/events are often unknown, or at least partly unknown. Con- 
sequently, the model is also, at least partly, unknown. Sometimes, some 
functional form of the model is assumed, but the free parameters still 
have to be determined. In any case, empirical data is needed in order to 
establish the model, to tune the classifier/estimator-under-development, 
and also to evaluate the design. Obviously, the training/evaluation data 
should be obtained from the process we are interested in. 

In fact, all three subjects share the same key issue related to modelling, 
namely the selection of the appropriate generalization level. The empirical 
data is only an example of a set of possible measurements. If too much 
weight is given to the data at hand, the risk of overfitting occurs. The 
resulting model will depend too much on the accidental peculiarities (or 
noise) of the data. On the other hand, if too little weight is given, nothing will 
be learned and the model completely relies on the prior knowledge. The right 
balance between these opposite sides depends on the statistical significance 
of the data. Obviously, the size of the data is an important factor. However, 
the statistical significance also holds a relation with dimensionality. 

Many of the mathematical techniques for modelling, tuning, training 
and evaluation can be shared between the three subjects. Estimation 
procedures used in classification can also be used in parameter estima- 
tion or state estimation with just minor modifications. For instance, 
probability density estimation can be used for classification purposes, 
and also for estimation. Data-fitting techniques are applied in both 
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classification and estimation problems. Techniques for statistical infer- 
ence can also be shared. Of course, there are also differences between the 
three subjects. For instance, the modelling of dynamic systems, usually 
called system identification, involves aspects that are typical for dynamic 
systems (i.e. determination of the order of the system, finding an appro- 
priate functional structure of the model). However, when it finally 
comes to finding the right parameters of the dynamic model, the tech- 
niques from parameter estimation apply again. 

Figure 1.4 shows an overview of the relations between the topics. 
Classification and parameter estimation share a common foundation 
indicated by ‘Bayes’. In combination with models for dynamic systems 
(with random inputs), the techniques for classification and parameter 
estimation find their application in processes that proceed in time, i.e. 
state estimation. All this is built on a mathematical basis with selected 
topics from mathematical analysis (dealing with abstract vector spaces, 
metric spaces and operators), linear algebra and probability theory. 
As such, classification and estimation are not tied to a specific application. 
The engineer, who is involved in a specific application, should add the 
individual characteristics of that application by means of the models and 
prior knowledge. Thus, apart from the ability to handle empirical data, 
the engineer must also have some knowledge of the physical background 
related to the application at hand and to the sensor technology being used. 
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Figure 1.4 Relations between the subjects 
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All three subjects are mature research areas, and many overview 
books have been written. Naturally, by combining the three subjects 
into one book, it cannot be avoided that some details are left out. 
However, the discussion above shows that the three subjects are close 
enough to justify one integrated book, covering these areas. 

The combination of the three topics into one book also introduces 
some additional challenges if only because of the differences in termin- 
ology used in the three fields. This is, for instance, reflected in the 
difference in the term used for ‘measurements’. In classification theory, 
the term ‘features’ is frequently used as a replacement for ‘measure- 
ments’. The number of measurements is called the ‘dimension’, but in 
classification theory the term ‘dimensionality’ is often used.' The same 
remark holds true for notations. For instance, in classification theory the 
measurements are often denoted by x. In state estimation, two notations 
are in vogue: either y or z (MATLAB uses y, but we chose z). In all cases 
we tried to be as consistent as possible. 


1.2 ENGINEERING 


The top-down design of an instrument always starts with some primary 
need. Before starting with the design, the engineer has only a global view of 
the system of interest. The actual need is known only at a high and abstract 
level. The design process then proceeds through a number of stages during 
which progressively more detailed knowledge becomes available, and the 
system parts of the instrument are described at lower and more concrete 
levels. At each stage, the engineer has to make design decisions. Such 
decisions must be based on explicitly defined evaluation criteria. The 
procedure, the elementary design step, is shown in Figure 1.5. It is used 
iteratively at the different levels and for the different system parts. 

An elementary design step typically consists of collecting and organiz- 
ing knowledge about the design issue of that stage, followed by an 
explicit formulation of the involved task. The next step is to associate 





1 Our definition complies with the mathematical definition of ‘dimension’, i.e. the maximal 
number of independent vectors in a vector space. In MATLAB the term ‘dimension’ refers to an 
index of a multidimensional array as in phrases like: ‘the first dimension of a matrix is the row 
index’, and ‘the number of dimensions of a matrix is two’. The number of elements along a row 
is the ‘row dimension’ or ‘row length’. In MATLAB the term ‘dimensionality’ is the same as the 
‘number of dimensions’. 
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Figure 1.5 An elementary step in the design process (Finkelstein and Finkelstein, 
1994) 


the design issue with an evaluation criterion. The criterion expresses the 
suitability of a design concept related to the given task, but also other 
aspects can be involved, such as cost of manufacturing, computational 
cost or throughput. Usually, there is a number of possible design con- 
cepts to select from. Each concept is subjected to an analysis and an 
evaluation, possibly based on some experimentation. Next, the engineer 
decides which design concept is most appropriate. If none of the possible 
concepts are acceptable, the designer steps back to an earlier stage to 
alter the selections that have been made there. 

One of the first tasks of the engineer is to identify the actual need that 
the instrument must fulfil. The outcome of this design step is a descrip- 
tion of the functionality, e.g. a list of preliminary specifications, operat- 
ing characteristics, environmental conditions, wishes with respect to user 
interface and exterior design. The next steps deal with the principles and 
methods that are appropriate to fulfil the needs, i.e. the internal func- 
tional structure of the instrument. At this level, the system under design 
is broken down into a number of functional components. Each com- 
ponent is considered as a subsystem whose input/output relations are 
mathematically defined. Questions related to the actual construction, 
realization of the functions, housing, etc., are later concerns. 

The functional structure of an instrument can be divided roughly into 
sensing, processing and outputting (displaying, recording). This book 
focuses entirely on the design steps related to processing. It provides: 
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e Knowledge about various methods to fulfil the processing tasks of 
the instrument. This is needed in order to generate a number of 
different design concepts. 

e Knowledge about how to evaluate the various methods. This is 
needed in order to select the best design concept. 

e A tool for the experimental evaluation of the design concepts. 


The book does not address the topic ‘sensor technology’. For this, many 
good textbooks already exist, for instance see Regtien et al. (2004) and 
Brignell and White (1996). Nevertheless, the sensory system does have a 
large impact on the required processing. For our purpose, it suffices to 
consider the sensory subsystem at an abstract functional level such that it 
can be described by a mathematical model. 


1.3 THE ORGANIZATION OF THE BOOK 


The first part of the book, containing Chapters 2, 3 and 4, considers each of 
the three topics — classification, parameter estimation and state estimation — 
at a theoretical level. Assuming that appropriate models of the objects, 
physical process or events, and of the sensory system are available, these 
three tasks are well defined and can be discussed rigorously. This facilitates 
the development of a mathematical theory for these topics. 

The second part of the book, Chapters 5 to 8, discusses all kinds of 
issues related to the deployment of the theory. As mentioned in Section 
1.1, a key issue is modelling. Empirical data should be combined with 
prior knowledge about the physical process underlying the problem at 
hand, and about the sensory system used. For classification problems, 
the empirical data is often represented by labelled training and evalua- 
tion sets, i.e. sets consisting of measurement vectors of objects together 
with the true classes to which these objects belong. Chapters 5 and 6 
discuss several methods to deal with these sets. Some of these techni- 
ques — probability density estimation, statistical inference, data fitting — 
are also applicable to modelling in parameter estimation. Chapter 7 is 
devoted to unlabelled training sets. The purpose is to find structures 
underlying these sets that explain the data in a statistical sense. This is 
useful for both classification and parameter estimation problems. The 
practical aspects related to state estimation are considered in Chapter 8. 
In the last chapter all the topics are applied in some fully worked out 
examples. Four appendices are added in order to refresh the required 
mathematical background knowledge. 


12 INTRODUCTION 


The subtitle of the book, ‘An Engineering Approach using MATLAB’, indi- 
cates that its focus is not just on the formal description of classification, 
parameter estimation and state estimation methods. It also aims to 
provide practical implementations of the given algorithms. These imple- 
mentations are given in MATLAB. MATLAB is a commercial software 
package for matrix manipulation. Over the past decade it has become 
the de facto standard for development and research in data-processing 
applications. MATLAB combines an easy-to-learn user interface with a 
simple, yet powerful language syntax, and a wealth of functions orga- 
nized in toolboxes. We use MATLAB as a vehicle for experimentation, 
the purpose of which is to find out which method is the most appro- 
priate for a given task. The final construction of the instrument can also 
be implemented by means of MATLAB, but this is not strictly necessary. 
In the end, when it comes to realization, the engineer may decide to 
transform his design of the functional structure from MATLAB to other 
platforms using, for instance, dedicated hardware, software in 
embedded systems or virtual instrumentation such as LabView. 

For classification we will make use of PRTools (described in Appendix E), 
a pattern recognition toolbox for MATLAB freely available for non-com- 
mercial use. MATLAB itself has many standard functions that are useful for 
parameter estimation and state estimation problems. These functions are 
scattered over a number of toolboxes. Appendix F gives a short overview of 
these toolboxes. The toolboxes are accompanied with a clear and crisp 
documentation, and for details of the functions we refer to that. 

Each chapter is followed by a few exercises on the theory provided. 
However, we believe that only working with the actual algorithms will 
provide the reader with the necessary insight to fully understand the 
matter. Therefore, a large number of small code examples are provided 
throughout the text. Furthermore, a number of data sets to experiment 
with are made available through the accompanying website. 
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Detection and Classification 


Pattern classification is the act of assigning a class label to an object, a 
physical process or an event. The assignment is always based on meas- 
urements that are obtained from that object (or process, or event). The 
measurements are made available by a sensory system. See Figure 2.1. 
Table 2.1 provides some examples of application fields in which classi- 
fication is the essential task. 

The definition of the set of relevant classes in a given application is in 
some cases given by the nature of the application, but in other cases the 
definition is not trivial. In the application ‘character reading for license 
plate recognition’, the choice of the classes does not need much discus- 
sion. However, in the application ‘sorting tomatoes into “class A”, “class 
B”, and “class C” the definition of the classes is open for discussion. In 
such cases, the classes are defined by a generally agreed convention that 
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Figure 2.1 Pattern classification 
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Table 2.1 Some application fields of pattern classification 


Application field Possible measurements Possible classes 





Object classification 


Sorting electronic Shape, colour ‘resistor’, ‘capacitor’, 
parts ‘transistor’, ‘IC’ 
Sorting mechanical Shape ‘ring’, ‘nut’, ‘bolt’ 
parts 
Reading characters Shape ‘A’, ‘P, ‘C, 
Mode estimation in a physical process 
Classifying Tracked point features ‘straight on’, ‘turning’ 
manoeuvres of a in an image sequence 
vehicle 
Fault diagnosis ina Cylinder pressures, ‘normal operation’, ‘defect 
combustion engine temperature, vibrations, fuel injector’, ‘defect air 
acoustic emissions, crank inlet valve’, ‘leaking 
angle resolver, exhaust valve’, 
Event detection 
Burglar alarm Infrared ‘alarm’, ‘no alarm’ 
Food inspection Shape, colour, temperature, ‘OK’, ‘NOT OK’ 


mass, volume 


the object is qualified according to the values of some attributes of the 
object, e.g. its size, shape and colour. 

The sensory system measures some physical properties of the object 
that, hopefully, are relevant for classification. This chapter is confined 
to the simple case where the measurements are static, i.e. time inde- 
pendent. Furthermore, we assume that for each object the number of 
measurements is fixed. Hence, per object the outcomes of the measure- 
ments can be stacked to form a single vector, the so-called measurement 
vector. The dimension of the vector equals the number of meas- 
urements. The union of all possible values of the measurement vector 
is the measurement space. For some authors the word ‘feature’ is very 
close to ‘measurement’, but we will reserve that word for later use in 
Chapter 6. 

The sensory system must be designed so that the measurement vector 
conveys the information needed to classify all objects correctly. If this is 
the case, the measurement vectors from all objects behave according to 
some pattern. Ideally, the physical properties are chosen such that all 
objects from one class form a cluster in the measurement space without 
overlapping the clusters formed by other classes. 
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Example 2.1 Classification of small mechanical parts 

Many workrooms have a spare part box where small, obsolete 
mechanical parts such as bolts, rings, nuts and screws are kept. Often, 
it is difficult to find a particular part. We would like to have the parts 
sorted out. For automated sorting we have to classify the objects by 
measuring some properties of each individual object. Then, based on 
the measurements we decide to what class that object belongs. 

As an example, Figure 2.2(a) shows an image with rings, nuts, bolts 
and remaining parts, called scrap. These four types of objects will be 
classified by means of two types of shape measurements. The first 
type expresses to what extent the object is six-fold rotational sym- 
metric. The second type of measurement is the eccentricity of the 
object. The image-processing technique that is needed to obtain these 
measurements is a topic that is outside the scope of this book. 

The 2D measurement vector of an object can be depicted as a point 
in the 2D measurement space. Figure 2.2(b) shows the graph of the 
points of all objects. Since the objects in Figure 2.2(a) are already 
sorted manually, it is easy here to mark each point with a symbol that 
indicates the true class of the corresponding object. Such a graph is 
called a scatter diagram of the data set. 

The measure for six-fold rotational symmetry is suitable to discrim- 
inate between rings and nuts since rings and nuts have a similar shape 
except for the six-fold rotational symmetry of a nut. The measure for 
eccentricity is suitable to discriminate bolts from the nuts and the rings. 




















> 0.8 atthe P ar P . Pons LA 

9°09 00 yi CAT z Ki : ino 
0090 O Pa = ee 
o of09 l [= S ; x x ; 

o 2 T : 

(6) ° O (9) T ¢ 7 re 5 S 

(6) 990° = — 2 
09092 A 
O 0o20 0 0% a ro oaf ag l 
0° o Oo Ne |g ae ee T. 











0 0.2 0.4 0.6 0.8 1 
measure of six-fold rotational symmetry 


Figure 2.2 Classification of mechanical parts. (a) Image of various objects, 
(b) Scatter diagram 
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The shapes of scrap objects are difficult to predict. Therefore, their 
measurements are scattered all over the space. 

In this example the measurements are more or less clustered accord- 
ing to their true class. Therefore, a new object is likely to have 
measurements that are close to the cluster of the class to which the 
object belongs. Hence, the assignment of a class boils down to decid- 
ing to which cluster the measurements of the object belong. This can 
be done by dividing the 2D measurement space into four different 
partitions; one for each class. A new object is classified according to 
the partitioning to which its measurement vector points. 

Unfortunately, some clusters are in each other’s vicinity, or even 
overlapping. In these regions the choice of the partitioning is critical. 


This chapter addresses the problem of how to design a pattern classifier. 
This is done within a Bayesian-theoretic framework. Section 2.1 
discusses the general case. In Sections 2.1.1 and 2.1.2 two particular 
cases are dealt with. The so-called ‘reject option’ is introduced in Section 
2.2. Finally, the two-class case, often called ‘detection’, is covered by 
Section 2.3. 


2.1 BAYESIAN CLASSIFICATION 


Probability theory is a solid base for pattern classification design. In this 
approach the pattern-generating mechanism is represented within a 
probabilistic framework. Figure 2.3 shows such a framework. The start- 
ing point is a stochastic experiment (Appendix C.1) defined by a set 
Q = {w1,...,wx} of K classes. We assume that the classes are mutually 
exclusive. The probability P(w,) of having a class w, is called the prior 
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Figure 2.3 Statistical pattern classification 
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probability. It represents the knowledge that we have about the class of 
an object before the measurements of that object are available. Since the 
number of possible classes is K, we have: 


K 
S Plo) = 1 (2.1) 


k=1 


The sensory system produces a measurement vector z with dimension N. 
Objects from different classes should have different measurement vec- 
tors. Unfortunately, the measurement vectors from objects within the 
same class also vary. For instance, the eccentricities of bolts in Figure 2.2 
are not fixed since the shape of bolts is not fixed. In addition, all 
measurements are subject to some degree of randomness due to all kinds 
of unpredictable phenomena in the sensory system, e.g. quantum noise, 
thermal noise, quantization noise. The variations and randomness are 
taken into account by the probability density function of z. 

The conditional probability density function of the measurement vec- 
tor z is denoted by p(z|w,). It is the density of z coming from an object 
with known class wg. If z comes from an object with unknown class, its 
density is indicated by p(z). This density is the unconditional density of z. 
Since classes are supposed to be mutually exclusive, the unconditional 
density can be derived from the conditional densities by weighting these 
densities by the prior probabilities: 


K 
=X p(z\ux)P (2.2) 


The pattern classifier casts the measurement vector in the class that will 
be assigned to the object. This is accomplished by the so-called decision 
function w(.) that maps the measurement space onto the set of possible 
classes. Since z is an N-dimensional vector, the function maps R onto Q. 
That is: @(.): RY — Q. 


Example 2.2 Probability densities of the ‘mechanical parts’ data 

Figure 2.4 is a graphical representation of the probability densities of 
the measurement data from Example 2.1. The unconditional density 
p(z) is derived from (2.2) by assuming that the prior probabilities 
P(w) are reflected in the frequencies of occurrence of each type of 
object in Figure 2.2. In that figure, there are 94 objects with frequen- 
cies bolt:nut:ring:scrap = 20:28:27:19. Hence the corresponding prior 
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Figure 2.4 Probability densities of the measurements shown in Figure 2.2. (a) The 
3D plot of the unconditional density together with a 2D contour plot of this density 
on the ground plane. (b) 2D contour plots of the conditional probability densities 


probabilities are assumed to be 20/94, 28/94, 27/94 and 19/94, 
respectively. 

The probabilities densities shown in Figure 2.4 are in fact not the 
real densities, but they are estimates obtained from the samples. The 
topic of density estimation will be dealt with in Chapter 5. PRTools 
code to plot 2D-contours and 3D-meshes of a density is given in 
Listing 2.1. 


Listing 2.1 
PRTools code for creating density plots. 


load nutsbolts; 

w=gaussm(z,1); 

figure(1); scatterd (z); hold on; 

plotm(w,6,[0.10.51.0]); 

figure(2); scatterd (z); holdon; 

forc=1:4 
w=gaussm(seldat(z,c),1); 
plotm(w,2,[0.10.51.0]); 

end; 


ae 


Load the dataset; see listing5.1 
Estimate amixture of Gaussians 


ae 


oe 


Plot in 3D 


o 


Estimate a Gaussian per class 
Plot in 2D 


oe 


In some cases, the measurement vectors coming from objects with differ- 
ent classes show some overlap in the measurement space. Therefore, it 
cannot always be guaranteed that the classification is free from mistakes. 
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An erroneous assignment of a class to an object causes some damage, or 
some loss of value of the misclassified object, or an impairment of its 
usefulness. All this depends on the application at hand. For instance, in the 
application ‘sorting tomatoes into classes A, B or C’, having a class B 
tomato being misclassified as ‘class C’ causes a loss of value because a 
‘class B’ tomato yields more profit than a ‘class C’ tomato. On the other 
hand, if a class C tomato is misclassified as a ‘class B’ tomato, the damage 
is much more since such a situation may lead to malcontent customers. 

A Bayes classifier is a pattern classifier that is based on the following 
two prerequisites: 


e The damage, or loss of value, involved when an object is erro- 
neously classified can be quantified as a cost. 
e The expectation of the cost is acceptable as an optimization criterion. 


If the application at hand meets these two conditions, then the develop- 
ment of an optimal pattern classification is theoretically straightforward. 
However, the Bayes classifier needs good estimates of the densities of the 
classes. These estimates can be problematic to obtain in practice. 

The damage, or loss of value, is quantified by a cost function (or loss 
function) C(w®|w,). The function C(.|.):Q x Q — R expresses the cost that is 
involved when the class assigned to an object is w, while the true class of that 
object is wz. Since there are K classes, the function C(®|w,) is fully specified 
by a K x K matrix. Therefore, sometimes the cost function is called a cost 
matrix. In some applications, the cost function might be negative, expressing 
the fact that the assignment of that class pays off (negative cost = profit). 


Example 2.3 Cost function of the mechanical parts application 

In fact, automated sorting of the parts in a ‘bolts-and-nuts’ box is an 
example of a recycling application. If we are not collecting the 
mechanical parts for reuse, these parts would be disposed of. There- 
fore, a correct classification of a part saves the cost of a new part, and 
thus the cost of such a classification is negative. However, we have to 
take into account that: 


e The effort of classifying and sorting a part also has to be paid. This 
cost is the same for all parts regardless of its class and whether it has 
been classified correctly or not. 

e A bolt that has been erroneously classified as a nut or a ring causes 
more trouble than a bolt that has been erroneously misclassified as 
scrap. Likewise arguments hold for a nut and a ring. 
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Table 2.2 is an example of a cost function that might be appropriate 
for this application. 


The concepts introduced above, i.e. prior probabilities, conditional 
densities and cost function, are sufficient to design optimal classifiers. 
However, first another probability has to be derived: the posterior 
probability P(w,|z). It is the probability that an object belongs to 
class wg given that the measurement vector associated with that object 
is z. According to Bayes’ theorem for conditional probabilities 
(Appendix C.2) we have: 


(2.3) 


If an arbitrary classifier assigns a class w; to a measurement vector z 
coming from an object with true class wg, then a cost C(w;|w,) is 
involved. The posterior probability of having such an object is P(w,|z). 
Therefore, the expectation of the cost is: 


K 
R(Gjlz) = E[C(Gjlwg)|z] = X C@j|we)P(welz) (2.4) 
k=1 


This quantity is called the conditional risk. It expresses the expected cost 
of the assignment ù; to an object whose measurement vector is Z. 
From (2. i it follows that the conditional risk of a decision function 
w(z) is R(w&(z)|z). The overall risk can be found by averaging the condi- 
tional risk over all possible measurement vectors: 


R = E[R((2)|z2)] = f R(o(z)|z)p(z)dz (2.5) 


Z 


Table 2.2 Cost function of the ‘sorting mechanical part’ application 





C(@;\@z) in $ True class 

w1 = bolt w = nut w3 = ring w4 = scrap 
g ô = bolt —0.20 0.07 0.07 0.07 
5 2 ô = nut 0.07 —0.15 0.07 0.07 
‘Bo | w = ring 0.07 0.07 —0.05 0.07 
< w@ = scrap 0.03 0.03 0.03 0.03 
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The integral extends over the entire measurement space. The quantity R 
is the overall risk (average risk, or briefly, risk) associated with the 
decision function w(z). The overall risk is important for cost price 
calculations of a product. 

The second prerequisite mentioned above states that the optimal 
classifier is the one with minimal risk R. The decision function that 
minimizes the (overall) risk is the same as the one that minimizes the 
conditional risk. Therefore, the Bayes classifier takes the form: 


WpayEs(Z) =w; such that: R(w|z) < R(w\z) ij=1,...,K (2.6) 
This can be expressed more briefly by: 


Wpayes(X) = argmin{R(w|z)} (2:7) 


wEQ 


The expression argmin{} gives the element from Q that minimizes 
R(w|z). Substitution of (2.3) and (2.4) yields: 


wEeQ 


K 
WBAYES(Z) = sgn Y` C(w Palo} 


k=1 
= argmin = wlw p(z|w)P (wr) 
=argm {ea k) A ) (2.8) 





K 
= sgn y- C(w opada] 


wEQ k=1 


Pattern classification according to (2.8) is called Bayesian classification 
or minimum risk classification. 


Example 2.4 Bayes classifier for the mechanical parts application 
Figure 2.5(a) shows the decision boundary of the Bayes classifier 
for the application discussed in the previous examples. Figure 
2.5(b) shows the decision boundary that is obtained if the prior 
probability of scrap is increased to 0.50 with an evenly decrease of 
the prior probabilities of the other classes. Comparing the results 
it can be seen that such an increase introduces an enlargement of 
the compartment for the scrap at the expense of the other com- 
partments. 
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Figure 2.5 Bayes classification. (a) With prior probabilities: P(bolt) = 0.21, 
P(nut) = 0.30, P(ring) = 0.29, and P(scrap) = 0.20. (b) With increased prior prob- 
ability for scrap: P(scrap) = 0.50. (c) With uniform cost function 


The overall risk associated with the decision function in Figure 2.5(a) 
appears to be —$0.092; the one in Figure 2.5(b) is —$0.036. The 
increase of cost (= decrease of profit) is due to the fact that scrap is 
unprofitable. Hence, if the majority of a bunch of objects consists of 
worthless scrap, recycling pays off less. 

The total cost of all classified objects as given in Figure 2.5(a) 
appears to be —$8.98. Since the figure shows 94 objects, the average 
cost is —$8.98/94 = —$0.096. As expected, this comes close to the 
overall risk. 
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Listing 2.2 
PRTools code for estimating decision boundaries taking account of the 
cost. 


load nutsbolts; 

cost= [—0.20 OOF 0.07 G07 3 
0.07 LS 0.07 0.07; 
0.07 0.07 —0.05 G.07 3 
0.03 0.03 0.03 0.03]; 





wl=qdc(z); % Estimate a single Gaussian per class 
% Change output according to cost 

w2=wl*classc*costm([],cost) ; 

scatterd(z); 

plotc (w1); % Plot without using cost 

plotc (w2); % Plot using cost 


2.1.1 Uniform cost function and minimum error rate 


A uniform cost function is obtained if a unit cost is assumed when an 
object is misclassified, and zero cost when the classification is correct. 
This can be written as: 


`, —4_ $j oy ee th J1 ift=k 

C(òilwk) = 1—6(i,k) with: 6(i,k) = o uhat (2.9) 
ôli k) is the Kronecker delta function. With this cost function the condi- 
tional risk given in (2.4) simplifies to: 


K 
R(àilz = >> P(wlz) = 1 — P(@ilz) (2.10) 


Minimization of this risk is equivalent to maximization of the posterior 
probability P(w;\z). Therefore, with a uniform cost function, the Bayes 
decision function (2.8) becomes the maximum a posteriori probability 
classifier (MAP classifier): 


Wmap(zZ) = argmax{P(w|z)} (2.11) 


wEQ 


Application of Bayes’ theorem for conditional probabilities and cancel- 
lation of irrelevant terms yield a classification, equivalent to a MAP 
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classification, but fully in terms of the prior probabilities and the condi- 
tional probability densities: 


wWmap(zZ) = argmax{p(z|w)P(w)} (2.12) 


wEQ 


The functional structure of this decision function is given in Figure 2.6. 

Suppose that a class w; is assigned to an object with measurement 
vector z. The probability of having a correct classification is P(w;|z). 
Consequently, the probability of having a classification error is 
1 — P(w;|z). For an arbitrary decision function w(z), the conditional error 
probability is: 


e(z) = 1 — P(w(z)|z) (2.13) 


It is the probability of an erroneous classification of an object whose 
measurement is z. The error probability averaged over all objects can be 
found by averaging e(z) over all the possible measurement vectors: 


E = Ef[e(z)] = l e(z)p(z)dz (2.14) 


The integral extends over the entire measurement space. E is called the 
error rate, and is often used as a performance measure of a classifier. 
The classifier that yields the minimum error rate among all other 
classifiers is called the minimum error rate classifier. With a uniform 
cost function, the risk and the error rate are equal. Therefore, the 
minimum error rate classifier is a Bayes classifier with uniform cost 
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Figure 2.6 Bayes decision function with uniform cost function (MAP classification) 
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function. With our earlier definition of MAP classification we come to 
the following conclusion: 


Minimum error rate ___— Bayes classification | MAP 
classification ~ with unit cost function classification 


The conditional error probability of a MAP classifier is found by sub- 
stitution of (2.11) in (2.13): 


emin(z) = = max{ P(w|z)} (2.15) 
WE 
The minimum error rate Emin follows from (2.14): 


a ‘| emin(Z)p(z)dz (2.16) 


Z 


Of course, phrases like ‘minimum’ and ‘optimal’ are strictly tied to the 
given sensory system. The performance of an optimal classification with 
a given sensory system may be less than the performance of a non- 
optimal classification with another sensory system. 


Example 2.5 MAP classifier for the mechanical parts application 
Figure 2.5(c) shows the decision function of the MAP classifier. The 
error rate for this classifier is 4.8%, whereas the one of the Bayes 
classifier in Figure 2.5(a) is 5.3%. In Figure 2.5(c) four objects are 
misclassified. In Figure 2.5(a) that number is five. Thus, with respect 
to error rate, the MAP classifier is more effective compared with the 
Bayes classifier of Figure 2.5(a). On the other hand, the overall risk of 
the classifier shown in Figure 2.5(c) and with the cost function given 
in Table 2.2 is —$0.084 which is a slight impairment compared with 
the —$0.092 of Figure 2.5(a). 


2.1.2 Normal distributed measurements; linear and quadratic 
classifiers 


A further development of Bayes classification with uniform cost function 
requires the specification of the conditional probability densities. This 
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section discusses the case in which these densities are modelled as 
normal. Suppose that the measurement vectors coming from an object 
with class w, are normally distributed with expectation vector 4, and 
covariance matrix C, (see Appendix C.3): 





p(z|wp) = 1 exp (£ A) Cele =) (2.17) 
(2) “|Cy| 


where N is the dimension of the measurement vector. 
Substitution of (2.17) in (2.12) gives the following minimum error rate 
classification: 





We can take the logarithm of the function between braces without 
changing the result of the argmax{ } function. Furthermore, all terms 
not containing k are irrelevant. Therefore (2.18) is equivalent to 


w(z) =w; with 


(2.19) 


Hence, the expression of a minimum error rate classification with nor- 
mally distributed measurement vectors takes the form of: 


O(z)=w; with i=argmax{w, +z w, +z Wz} (2.20) 
k= 


pis 
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with: 

we = — In |Ca] + 21n P(w) — WE Cy Mg 

we = 2C} My, (2.21) 

W; = -C7;' 
A classifier according to (2.20) is called a quadratic classifier and the 
decision function is a quadratic decision function. The boundaries 
between the compartments of such a decision function are pieces of 
quadratic hypersurfaces in the N-dimensional space. To see this, it 
suffices to examine the boundary between the compartments of two 
different classes, e.g. w; and wj. According to (2.20) the boundary 
between the compartments of these two classes must satisfy the follow- 
ing equation: 


wi + z'w; +z W; = wj + zw; + z'W;z (2.22) 


or: 





Tw; — w;) +z (W; — W;)z = 0 (2.23) 


Equation (2.23) is quadratic in z. In the case that the sensory system has 
only two sensors, i.e. N = 2, then the solution of (2.23) is a quadratic 
curve in the measurement space (an ellipse, a parabola, an hyperbola, or 
a degenerated case: a circle, a straight line, or a pair of lines). Examples 
will follow in subsequent sections. If we have three sensors, N = 3, then 
the solution of (2.23) is a quadratic surface (ellipsoid, paraboloid, 
hyperboloid, etc.). If N > 3, the solutions are hyperquadrics (hyperellip- 
soids, etc.). 

If the number of classes is more than two, K > 2, then (2.23) is a 
necessary condition for the boundaries between compartments, but not 
a sufficient one. This is because the boundary between two classes may be 
intersected by a compartment of a third class. Thus, only pieces of the 
surfaces found by (2.23) are part of the boundary. The pieces of the sur- 
face that are part of the boundary are called decision boundaries. The 
assignment of a class to a vector exactly on the decision boundary is 
ambiguous. The class assigned to such a vector can be arbitrarily selected 
from the classes involved. 
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As an example we consider the classifications shown in Figure 2.5. 
In fact, the probability densities shown in Figure 2.4(b) are normal. 
Therefore, the decision boundaries shown in Figure 2.5 must be quad- 
ratic curves. 


Class-independent covariance matrices 


In this subsection, we discuss the case in which the covariance matrices 
do not depend on the classes, i.e. Cg = C for all w € Q. This situation 
occurs when the measurement vector of an object equals the (class- 
dependent) expectation vector corrupted by sensor noise, that is 
z=, +n. The noise n is assumed to be class-independent with covari- 
ance matrix C. Hence, the class information is brought forth by the 
expectation vectors only. 
The quadratic decision function of (2.19) degenerates into: 


w(x) =w; with 


i = argmax{2 In P(w) — (z -— u,) C7! (z — My)} 
k=1,...,K 
(2.24) 


k=1,....K 


Since the covariance matrix C is self-adjoint and positive definite 
(Appendix B.5) the quantity (z — y,)'C™!(z — 4,) can be regarded as a 
distance measure between the vector z and the expectation vector 4g. 
The measure is called the squared Mahalanobis distance. The function of 
(2.24) decides for the class whose expectation vector is nearest to the 
observed measurement vector (with a correction factor —2 In P(w,) to 
account for prior knowledge). Hence, the name minimum Mahalonobis 
distance classifier. 

The decision boundaries between compartments in the measurement 
space are linear (hyper)planes. This follows from (2.20) and (2.21): 


O(z)=w; with i=argmax{w, +z! wg} (2.25) 
k=1,...,K 
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where: 


wp = 21nP(we) — wp Co |, (2.26) 
Wk = 2C" tg 


A decision function which has the form of (2.25) is linear. The corre- 
sponding classifier is called a linear classifier. The equations of the 
decision boundaries are w; — wj + z! (w; — w;) = 0. 

Figure 2.7 gives an example of a four-class problem (K = 4) in a two- 
dimensional measurement space (N = 2). A scatter diagram with the 
contour plots of the conditional probability densities are given (Figure 
2.7(a)), together with the compartments of the minimum Mahalanobis 
distance classifier (Figure 2.7(b)). These figures were generated by the 
code in Listing 2.3. 


Listing 2.3 
PRTools code for minimum Mahalanobis distance classification 


mus:= [0.20.37 0.35 0.75; 0.65 0.55; 0.80.25); 

C=[0.018 0.007; 0.007 0.011]; z=gauss(200,mus,C) ; 
w=l1dc(z); % Normal densities, identical covariances 
figure(1); scatterd(z); hold on; plotm(w); 

figure(2); scatterd(z); holdon; plotc(w); 
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Figure 2.7 Minimum Mahalanobis distance classification. (a) Scatter diagram with 
contour plot of the conditional probability densities. (b) Decision boundaries 
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Minimum distance classification 


A further simplification is possible when the measurement vector equals 
the class-dependent vector M, corrupted by class-independent white 
noise with covariance matrix C = o7I. 


2 
w(z)=w, with i= gin -21n P(wp) + le-a) (2.27) 


oO 


The quantity ||(z — u,)|| is the normal (Euclidean) distance between z 
and u,. The classifier corresponding to (2.27) decides for the class whose 
expectation vector is nearest to the observed measurement vector (with a 
correction factor —20* log P(w,) to account for the prior knowledge). 
Hence, the name minimum distance classifier. As with the minimum 
Mahalanobis distance classifier, the decision boundaries between com- 
partments are linear (hyper)planes. The plane separating the compart- 
ments of two classes w; and wj is given by: 


(wi) 
(uy) 





P 1 
o log goy +5 Cll? — ll?) + 27 (ee; — u) = 0 (2.28) 


The solution of this equation is a plane perpendicular to the line segment 
connecting 4; and 4;. The location of the hyperplane depends on the 
factor o7 log (P(w;)/P(w;)). If P(w;) = P(w;), the hyperplane is the perpen- 
dicular bisector of the line segment (see Figure 2.8). 

Figure 2.9 gives an example of the decision function of the minimum 
distance classification. PRTools code to generate these figures is given in 
Listing 2.4. 
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Figure 2.8 Decision boundary of a minimum distance classifier 
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Figure 2.9 Minimum distance classification. (a) Scatter diagram with contour plot 
of the conditional probability densities. (b) Decision boundaries 


Listing 2.4 
PRTools code for minimum distance classification 


mus= [0.20.37 0.35 0.75; 0.650.55; 0.80.25]; 
C=0.01*eye(2); z=gauss(200,mus,C); 

% Normal densities, uncorrelated noise with equal variances 
w=nmsc(z); 

figure (1); scatterd (z); holdon; plotm (w); 

figure (2); scatterd (z); holdon; plotc (w); 


Class-independent expectation vectors 


Another interesting situation is when the class information is solely 
brought forth by the differences between covariance matrices. In that 
case, the expectation vectors do not depend on the class: uw, = 4 for all 
k. Hence, the central parts of the conditional probability densities overlap. 
In the vicinity of the expectation vector, the probability of making a 
wrong decision is always largest. The decision function takes the form of: 


w(x) =u; with 


(2.29) 
i= argmax{ — In |C,| + 21n P(w) — (z — u)” C7! (z —y)} 


k=1,....K 
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Figure 2.10 Classification of objects with equal expectation vectors. (a) Rotational 
symmetric conditional probability densities. (b) Conditional probability densities 
with different orientations; see text 


If the covariance matrices are of the type ożI, the decision boundaries are 
concentric circles or (hyper)spheres. Figure 2.10(a) gives an example of 
such a situation. If the covariance matrices are rotated versions of one 
prototype, the decision boundaries are hyperbolae. If the prior probabil- 
ities are equal, these hyperbolae degenerate into a number of linear 
planes (or, if N = 2, linear lines). An example is given in Figure 2.10(b). 


2.2 REJECTION 


Sometimes, it is advantageous to provide the classification with a 
so-called reject option. In some applications, an erroneous decision may 
lead to very high cost, or even to a hazardous situation. Suppose that the 
measurement vector of a given object is in the vicinity of the decision 
boundary. The measurement vector does not provide much class infor- 
mation then. Therefore, it might be beneficial to postpone the classifica- 
tion of that particular object. Instead of classifying the object, we reject 
the classification. On rejection of a classification we have to classify the 
object either manually, or by means of more involved techniques (for 
instance, by bringing in more sensors or more advanced classifiers). 
We may take the reject option into account by extending the range of 
the decision function by a new element: the rejection class wo. The range 
of the decision function becomes: Q* = {wo, wi,..., wx}. The decision 
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function itself is a mapping (z): Rẹ — Q+. In order to develop a Bayes 
classifier, the definition of the cost function must also be extended: 
C(w@|w): Q* x Q — R. The definition is extended so as to express the 
cost of rejection. C(wo|w,) is the cost of rejection while the true class of 
the object is we. 

With these extensions, the decision function of a Bayes classifier 
becomes (2.8): 


K 
Opayes(Z) = vind X Cisaat) (2.30) 
k=1 


wEeQr 


The further development of the classifier follows the same course as in (2.8). 


2.2.1 Minimum error rate classification with reject option 


The minimum error rate classifier can also be extended with a reject 
option. Suppose that the cost of a rejection is C,.; regardless of the true 
class of the object. All other costs are uniform and defined by (2.9). 
We first note that if the reject option is chosen, the risk is C,,;. If it is 
not, the minimal conditional risk is the émin(z) given by (2.15). Minimiza- 
tion of C,ej and @min(z) yields the following optimal decision function: 


: (2.31) 


p wo if Crej < emin(Z) 
w(z) = argmax{P(w|z)} otherwise 
wEQ 


The maximum posterior probability max{P(w|z)} is always greater than 
or equal to 1/K. Therefore, the minimal conditional error probability is 
bounded by (1 — 1/K). Consequently, in (2.31) the reject option never 
wins if Cre) > 1 — 1/K. 

The overall probability of having a rejection is called the reject rate. 
It is found by calculating the fraction of measurements that fall inside 
the reject region: 


Rej-Rate =| p(z)dz (2.32) 
{2| Crej<emin( )} 


The integral extends over those regions in the measurement space for 
which C,.; < e(z). The error rate is found by averaging the conditional 


34 DETECTION AND CLASSIFICATION 


error over all measurements except those that fall inside the reject 
region: 


Emin =| €min(Z)p(z)dz (2.33) 
{2|Cyej>€min (2) } 


Comparison of (2.33) with (2.16) shows that the error rate of a classi- 
fication with reject option is bounded by the error rate of a classification 
without reject option. 


Example 2.6 The reject option in the mechanical parts application 
In the classification of bolts, nuts, rings and so on, discussed in the 
previous examples, it might be advantageous to manually inspect 
those parts whose automatic classification is likely to fail. We 
assume that the cost of manual inspection is about $0.04. Table 
2.3 tabulates the cost function with the reject option included (com- 
pare with Table 2.2). 

The corresponding classification map is shown in Figure 2.11. In 
this example, the reject option is advantageous only between the 
regions of the rings and the nuts. The overall risk decreases from 
—$0.092 per classification to —$0.093 per classification. The benefit 
of the reject option is only marginal because the scrap is an expensive 
item when offered to manual inspection. In fact, the assignment of an 
object to the scrap class is a good alternative for the reject option. 


Listing 2.5 shows the actual implementation in MATLAB. Clearly it is very 
similar to the implementation for the classification including the costs. To 
incorporate the reject option, not only the cost matrix has to be extended, 
but clabels has to be redefined as well. When these labels are not 
supplied explicitly, they are copied from the data set. In the reject case, an 
extra class is introduced, so the definition of the labels cannot be avoided. 


Table 2.3 Cost function of the mechanical part application with the reject option 
included 





C(@;\@,) in $ True class 

F w1 = bolt w2 = nut w3 = ring w4 = scrap 
$ | ô= bolt —0.20 0.07 0.07 0.07 

g |ô = ring 0.07 —0.15 0.07 0.07 

§& | = nut 0.07 0.07 —0.05 0.07 

‘S| w = scrap 0.03 0.03 0.03 0.03 

<x | w= wo = rejection —0.16 —0.11 0.01 0.07 
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Figure 2.11 Bayes classification with the reject option included 


Listing 2.5 
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PRTools code for minimum risk classification including a reject option 


load nutsbolts; 


cost = [ —0.20 0.07 0.07 0,07 ¢ 

0.07 =—0.15 0.07 0.07 3 

0.07 0.07 =—0.05 0.07 3 

0.03 0.03 0.03 06032 E pua 

=O. 16 —0.11 0s g1 007 13 
clabels=str2mat (getlablist(z),'reject'); 
wl=qdc(z); % Estimate a single Gaussian per class 





scatterd(z); 

% Change output according to cost 
([],cost’,clabels) ; 

plotc(wl); % Plot without using cost 

plotc (w2); % Plot using cost 


w2 =w1*classc*costm 


2.3 DETECTION: THE TWO-CLASS CASE 


The detection problem is a classification problem with two possible 
classes: K = 2. In this special case, the Bayes decision rule can be 
moulded into a simple form. Assuming a uniform cost function the 


MAP classifier, expressed in (2.12), reduces to the following test: 


p(z|wi)P(w1) > p(z|w2)P(w2) 


(2.34) 
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If the test fails, it is decided for w2, otherwise for w1. We write symbolically: 


wt 
> 

plzjw)P(w) p(zjw)P(w2) (2.35) 
< 
w2 


Rearrangement gives: 


palun) > Plw2) Bais 


p(z|w2) < P(w) 
w2 








Regarded as a function of w the conditional probability density p(z|wg) 
is called the likelihood function of wp. Therefore, the ratio: 


L(z) = Pae (2.37) 


is called the likelihood ratio. With this definition the classification 
becomes a simple likelihood ratio test: 


w1 
> P(w) 


< P(w) 
w2 


L(z) (2.38) 





The test is equivalent to a threshold operation applied to L(z) with 
threshold P(w2)/P(w1). 

Even if the cost function is not uniform, the Bayes detector retains the 
structure of (2.38), only the threshold should be adapted so as to reflect 
the change of cost. The proof of this is left as an exercise for the reader. 

In case of measurement vectors with normal distributions, it is con- 
venient to replace the likelihood ratio test with a so-called log-likelihood 
ratio test: 





w1 
> : = _ P(w2) 
ae T with A(z) = lIn L(z) and T = In (3) (2.39) 
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For vectors drawn from normal distributions, the log-likelihood ratio is: 


A(z) = — 5 (In|Ci] — In |C2] + (2— 4, )"CT' (z — 4) 


1 
2 (2.40) 
j ee 
—(2— Hy)"C3"(2— m)) 
which is much easier than the likelihood ratio. When the covariance 
matrices of both classes are equal (Cy = C2 = C) the log-likelihood ratio 
simplifies to: 


T 
A@)= (2-50 +m) Coama) 24D 


Two types of errors are involved in a detection system. Suppose that w(z) 
is the result of a decision based on the measurement z. The true (but 
unknown) class w of an object is either w1 or w2. Then the following four 
states may occur: 





W= Wy W= WwW 
lz) =u, correct decision I type II error 
olz) = wo type I error correct decision II 


Often, a detector is associated with a device that decides whether an 
object is present (w = w2) or not (w = w1), or that an event occurs or not. 
These types of problems arise, for instance, in radar systems, medical 
diagnostic systems, burglar alarms, etc. Usually, the nomenclature for 
the four states is as follows then: 





wW = w1 w = w2 

olz) = wy true negative missed event 
or false negative 
w(Z) = w2 false alarm detection (or hit) 


or false positive or true positive 


Sometimes, the true negative is called ‘rejection’. However, we have 

reserved this term for Section 2.2, where it has a different denotation. 
The probabilities of the two types of errors, i.e. the false alarm and the 

missed event, are performance measures of the detector. Usually these 
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probabilities are given conditionally with respect to the true classes, i.e. 
de de ae : 
Pa Plo |w2) and Pa Plona). In addition, we may define the prob- 


ability of a detection Py Ploi: 

The overall probability of a false alarm can be derived from the prior 
probability using Bayes’ theorem, e.g. P(wW2,w1) = P(@2|wW1)P(u41) = 
P,,P(w1). The probabilities P nis and P, as a function of the threshold 
T, follow from (2.39): 


IN 
P,(T) = P(A(z) < T|w1) = f (Alu dA 
(2.42) 


Paet(T) =1- Pmiss(T) 


In general, it is difficult to find analytical expressions for Pmiss(T) and 
P(T). In the case of Gaussian distributed measurement vectors, with 
Cı = C? = C, expression (2.42) can be further developed. Equation 
(2.41) shows that A(z) is linear in z. Since z has a normal distribution, 
so has A(z); see Appendix C.3.1. The posterior distribution of A(z) is 
fully specified by its conditional expectation and its variance. As A(z) is 
linear in z, these parameters are obtained as: 


T 
EIA(z)lor] = (Elelor] -30 +4) ) C'n =m) 





l ; 
= (m = 504 +) ) C =m) (2.43) 
= 5 (ety — My)" Coty — a) 
Likewise: 
EAG)l] = -50n —aa)"C My ~ Ao) (2.44) 
and: 


Var[A(z) |i] = (Hy — 4)” C (Hy — Ha) = Var[A(z)|u2] (2.45) 
With that, the signal-to-noise ratio is: 


(E[A|w2] — E[Alwi])” 
Var[A|w2] 





SNR = = (m -fy)'C',- fy) (2.46) 
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The quantity (44 — 42)" C7! (u4 — 4b) is the squared Mahalanobis dis- 


tance between 444 and u, with respect to C. The square root, d ay SNR 
is called the discriminability of the detector. It is the signal-to-noise ratio 
expressed as an amplitude ratio. 

The conditional probability densities of A are shown in Figure 2.12. 
The two overlapping areas in this figure are the probabilities of false 
alarm and missed event. Clearly, these areas decrease as d increases. 
Therefore, d is a good indicator of the performance of the detector. 

Knowing that the conditional probabilities are Gaussian, it is possible 
to evaluate the expressions for Pmiss(T) and P(T) in (2.42) analytically. 
The distribution function of a Gaussian random variable is given in 
terms of the error function erf(): 








1 
T=- 
1,1 2 
PalT) 7 T zef dV/2 
(2.47) 
1 
T+-d 
1 1 2 
Praise TE) 5) zef ENA 


Figure 2.13(a) shows a graph of Pimiss, Pfa, and Paet = 1 — Piss when the 
threshold T varies. It can be seen that the requirements for T are contra- 
dictory. The probability of a false alarm (type I error) is small if the 








F A 


Figure 2.12 The conditional probability densities of the log-likelihood ratio in the 
Gaussian case with C4 = C2 = C 
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Figure 2.13 Performance of a detector in the Gaussian case with equal covariance 
matrices. (a) Prissy Paet and Py, versus the threshold T. (b) Paer versus Py as a 
parametric plot of T 


threshold is chosen small. However, the probability of a missed event 
(type II error) is small if the threshold is chosen large. A trade-off must 
be found between both types of errors. 

The trade-off can be made explicitly visible by means of a parametric 
plot of Pye versus Pia with varying T. Such a curve is called a receiver 
operating characteristic curve (ROC curve). Ideally, Pa =0 and 
Paget = 1, but the figure shows that no threshold exists for which this 
occurs. Figure 2.13(b) shows the ROC curve for a Gaussian case with 
equal covariance matrices. Here, the ROC curve can be obtained analyt- 
ically, but in most other cases the ROC curve of a given detector must 
be obtained either empirically or by numerical integration of (2.42). 

In Listing 2.6 the MATLAB implementation for the computation of the 
ROC curve is shown. To avoid confusion about the roles of the different 
classes (which class should be considered positive and which negative) in 
PRTools the ROC curve shows the fraction false positive and false 
negative. This means that the resulting curve is a vertically mirrored 
version of Figure 2.13(b). Note also that in the listing the training set is 
used to both train a classifier and to generate the curve. To have a 
reliable estimate, an independent data set should be used for the estima- 
tion of the ROC curve. 


Listing 2.6 
PRTools code for estimation of a ROC curve 


z=gendats(100,1,2); % Generate a 1D dataset 
w=qdc(z); % Train a classifier 
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r=roc(z*w); % Compute the ROC curve 
plotr(r); % Plot it 


The merit of a ROC curve is that it specifies the intrinsic ability of the 
detector to discriminate between the two classes. In other words, 
the ROC curve of a detector depends neither on the cost function of 
the application nor on the prior probabilities. 

Since a false alarm and a missed event are mutually exclusive, the error 
rate of the classification is the sum of both probabilities: 


E= P(@2,w1) + P(w1, w2) 
= P(w) P(w) + P(w1|w2)P(w2) (2.48) 
= PaP(w1) + PmisP (w2) 


In the example of Figure 2.13, the discriminability d equals V8. If this 
indicator becomes larger, P,,i;; and Pu become smaller. Hence, the error 
rate E is a monotonically decreasing function of d. 


Example 2.7 Quality inspection of empty bottles 

In the bottling industry, the detection of defects of bottles (to be 
recycled) is relevant in order to assure the quality of the product. A 
variety of flaws can occur: cracks, dirty bottoms, fragments of glass, 
labels, etc. In this example, the problem of detecting defects of the 
mouth of an empty bottle is addressed. This is important, especially in 
the case of bottles with crown caps. Small damages of the mouth may 
cause a non-airtight enclosure of the product which subsequently 
causes an untimely decay. 

The detection of defects at the mouth is a difficult task. Some 
irregularities at the mouth seem to be perturbing, but in fact are 
harmless. Other irregularities (e.g. small intrusions at the surface of 
the mouth) are quite harmful. The inspection system (Figure 2.14) 
that performs the task consists of a stroboscopic, specular ‘light field’ 
illuminator, a digital camera, a detector, an actuator and a sorting 
mechanism. The illumination is such that in the absence of irregular- 
ities at the mouth, the bottle is seen as a bright ring (with fixed size 
and position) on a dark background. Irregularities at the mouth give 
rise to disturbances of the ring. See Figure 2.15. 

The decision of the inspection system is based on a measurement 
vector that is extracted from the acquired image. For this purpose the 
area of the ring is divided into 256 equally sized sectors. Within each 
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Figure 2.14 Quality inspection system for the recycling of bottles 
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Figure 2.15 Acquired images of two different bottles. (a) Image of the mouth of a 
new bottle. (b) Image of the mouth of an older bottle with clearly visible intrusions 


sector the average of the grey levels of the observed ring is estimated. 
These averages (as a function of running arc length along the ring) are 
made rotational invariant by a translation invariant transformation, 
e.g. the amplitude spectrum of the discrete Fourier transform. 

The transformed averages form the measurement vector z. The next 
step is the construction of a log-likelihood ratio A(z) according to 
(2.40). Comparing the likelihood ratio against a suitable threshold 
value gives the final decision. 

Such a detector should be trained with a set of bottles that are 
manually inspected so as to determine the parameters 444, My etc. (see 
Chapter 5). Another set of manually inspected bottles is used for evalua- 
tion. The result for a particular application is shown in Figure 2.16. 
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Figure 2.16 Estimated performance of the bottle inspector. (a) The conditional 
probability densities of the log-likelihood ratio. (b) The ROC curve 


It seems that the Gaussian assumption with equal covariance matrices is 
appropriate here. The discriminability appears to be d = 4.8. 
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Many good textbooks on pattern classification have been written. These 
books go into more detail than is possible here and approach the subject 
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2.) EXERCISES 


1. Give at least two more examples of classification systems. Also define possible meas- 
urements and the relevant classes. (0) 

2. Give a classification problem where the class definitions are subjective. (0) 

3. Assume we have three classes of tomato with decreasing quality, class ‘A’, class ‘B’ and 
class ‘C’. Assume further that the cost of misclassifying a tomato to a higher quality is 
twice as expensive as vice versa. Give the cost matrix. What extra information do you 
need in order to fully determine the matrix? (0) 

4. Assume that the number of scrap objects in Figure 2.2 is actually twice as large. How 
should the cost matrix, given in Table 2.2, be changed, such that the decision function 
remains the same? (0) 
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. What quantities do you need to compute the Bayes classifier? How would you obtain 


these quantities? (0) 


. Derive a decision function assuming that objects come from normally distributed 


classes as in Section 2.1.2, but now with an arbitrary cost function (*). 


. Can you think of a physical measurement system in which it can be expected that the 


class distributions are Gaussian and where the covariance matrices are independent 


of the class? (0) 


. Construct the ROC curve for the case that the classes have no overlap, and classes 


which are completely overlapping. (0) 


. Derive how the ROC curve changes when the class prior probabilities are changed. (0) 
10. 


Reconstruct the class conditional probabilities for the case that the ROC curve is not 
symmetric around the axis which runs from (1,0) to (0,1). (0) 


3 


Parameter Estimation 


Parameter estimation is the process of attributing a parametric descrip- 
tion to an object, a physical process or an event based on measurements 
that are obtained from that object (or process, or event). The measure- 
ments are made available by a sensory system. Figure 3.1 gives an over- 
view. Parameter estimation and pattern classification are similar 
processes because they both aim to describe an object using measure- 
ments. However, in parameter estimation the description is in terms of a 
real-valued scalar or vector, whereas in classification the description is in 
terms of just one class selected from a finite number of classes. 


Example 3.1 Estimation of the backscattering coefficient from 
SAR images 

In earth observation based on airborne SAR (synthetic aperture radar) 
imaging, the physical parameter of interest is the backscattering coef- 
ficient. This parameter provides information about the condition of 
the surface of the earth, e.g. soil type, moisture content, crop type, 
growth of the crop. 

The mean backscattered energy of a radar signal in a direction is 
proportional to this backscattering coefficient. In order to reduce 
so-called speckle noise the given direction is probed a number of 
times. The results are averaged to yield the final measurement. Figure 3.2 
shows a large number of realizations of the true backscattering 
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Figure 3.2 Different realizations of the backscattering coefficient and its corres- 
ponding measurement 
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coefficient and its corresponding measurement.’ In this example, the 
number of probes per measurement is eight. It can be seen that, even 
after averaging, the measurement is still inaccurate. Moreover, 
although the true backscattering coefficient is always between 0 and 1, 
the measurements can easily violate this constraint (some measure- 
ments are greater than 1). 

The task of a parameter estimator here is to map each measurement 
to an estimate of the corresponding backscattering coefficient. 


This chapter addresses the problem of how to design a parameter 
estimator. For that, two approaches exist: Bayesian estimation (Section 
3.1) and data-fitting techniques (Section 3.3). The Bayesian-theoretic 
framework for parameter estimation follows the same line of reasoning 
as the one for classification (as discussed in Chapter 2). It is a probabil- 
istic approach. The second approach, data fitting, does not have such a 
probabilistic context. The various criteria for the evaluation of an esti- 
mator are discussed in Section 3.2. 


3.1 BAYESIAN ESTIMATION 


Figure 3.3 gives a framework in which parameter estimation can be 
defined. The starting point is a probabilistic experiment where the out- 
come is a random vector x defined in R™, and with probability density 
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Figure 3.3 Parameter estimation 





'The data shown in Figure 3.2 is the result of a simulation. Therefore, in this case, the true 
backscattering coefficients are known. Of course, in practice, the true parameter of interest is 
always unknown. Only the measurements are available. 


48 PARAMETER ESTIMATION 


p(x). Associated with x is a physical object, process or event (in short: 
‘physical object’), of which x is a property. x is called a parameter 
vector, and its density p(x) is called the prior probability density. 

The object is sensed by a sensory system which produces an N-dimen- 
sional measurement vector z. The task of the parameter estimator is to 
recover the original parameter vector x given the measurement vector zZ. 
This is done by means of the estimation function X(z): RN + R™. The 
conditional probability density p(z|x) gives the connection between the 
parameter vector and measurements. With fixed x, the randomness of 
the measurement vector z is due to physical noise sources in the sensor 
system and other unpredictable phenomena. The randomness is charac- 
terized by p(z|x). The overall probability density of z is found by averaging 
the conditional density over the complete parameter space: 


p(z) = i p(alx)p(x)dx (3.1) 


The integral extends over the entire M-dimensional space R™. 
Finally, Bayes’ theorem for conditional probabilities gives us the 
posterior probability density p(x|z): 


p(z|x)p(x) 

p(x|z) = PPPS? 32 

(xja) = Pe (3.2) 
This density is most useful since z, being the output of the sensory 
system, is at our disposal and thus fully known. Thus, p(x|z) represents 
exactly the knowledge that we have on x after having observed z. 


Example 3.2 Estimation of the backscattering coefficient 

The backscattering coefficient x from Example 3.1 is within the 
interval [0,1]. In most applications, however, lower values of the 
coefficient occur more frequently than higher ones. Such a preference 
can be taken into account by means of the prior probability density 
p(x). We will assume that for a certain application x has a beta 
distribution: 


| 
p(x) = (4 ve yea x)? for 0<x<1 (3.3) 





The parameters a and b are the shape parameters of the distribution. 
In Figure 3.4(a) these parameters are set to a= 1 and b = 4. These 
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Figure 3.4 Probability densities for the backscattering coefficient. (a) Prior density 
p(x). (b) Conditional density p(z|x) with Nprobes = 8. The two axes have been scaled 
with x and 1/x, respectively, to obtain invariance with respect to x 


values will be used throughout the examples in this chapter. Note that 
there is no physical evidence for the beta distribution of x. The 
assumption is a subjective result of our state of knowledge concerning 
the occurrence of x. If no such knowledge is available, a uniform 
distribution between 0 and 1 (i.e. all x are equally likely) would be 
more reasonable. 

The measurement is denoted by z. The mathematical model for 
SAR measurements is that, with fixed x, the variable Np;obesz/x has 
a gamma distribution with parameter N,probes (the number of probes 
per measurement). The probability density associated with a gamma 
distribution is: 





gamma_pdf (u,a) = oe) u°! exp(—u) (3.4) 


where u is the independent variable, Tr(a) is the gamma function, a is 
the parameter of the distribution and U(x) is the unit step function 
which returns 0 if u is negative and 1 otherwise. Since z can be 
regarded as a gamma-distributed random variable scaled by a factor 
x/Nprobess the conditional density of z becomes: 
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N robes 
p(z|x) = Ulz) 22° gamma_pdf 


x 


N robes? 
(“Het Norte (3.5) 


Figure 3.4(b) shows the conditional density. 


Cost functions 


The optimization criterion of Bayes, minimum risk, applies to statistical 
parameter estimation provided that two conditions are met. First, it 
must be possible to quantify the cost involved when the estimates differ 
from the true parameters. Second, the expectation of the cost, the risk, 
should be acceptable as an optimization criterion. 

Suppose that the damage is quantified by a cost function 
C(x|x): R“ x R“ — R. Ideally, this function represents the true cost. 
In most applications, however, it is difficult to quantify the cost accur- 
ately. Therefore, it is common practice to choose a cost function whose 
mathematical treatment is not too complex. Often, the assumption is 
that the cost function only depends on the difference between estimated 
and true parameters: the estimation error e = $ — x. With this assump- 
tion, the following cost functions are well known (see Table 3.1): 


e quadratic cost function: 


M-1 
Ciia) = [1 — x||3 = DO (Bn — Xm)” (3.6) 
m=0 
e absolute value cost function: 
M-1 
C(x) = [$ — x|])= X lêm — Xm (3.7) 
m=0 


Table 3.1 Three different Bayes estimators worked out for the scalar case 























MMSE estimation MMAF estimation MAP estimation 
Quadratic cost function Absolute cost function Uniform cost function 
C(XIx) 
A, [— 
-x -x 
Xmuse(z) = E[x|z] XMMAgE(Z) = X Xmap(Z) = argmax {p(x|z) 


Zz) 
= h xp(x|z)dx with JEn p(x|z dx = } 
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e uniform cost function: 


ano f1 ae x= x||,> A oe 
cats) = { 4 if ||& — xll,< A with: A— 0 (3.8) 


The first two cost functions are instances of the Minkowski distance 
measures (see Appendix A.2). The third cost function is an approxima- 
tion of the distance measure mentioned in (a.22). 


Risk minimization 


With an arbitrarily selected estimate $ and a given measurement vector 
z, the conditional risk of x is defined as the expectation of the cost 
function: 


Rèl) = E[C(R'x)|2] = | ClR'x)p(xle)dx (3.9) 


In Bayes estimation (or minimum risk estimation) the estimate is the 
parameter vector that minimizes the risk: 


(z) = aremin{ R(X|2) 5 (3.10) 


The minimization extends over the entire parameter space. 
The overall risk (also called average risk) of an estimator xX(z) is the 
expected cost seen over the full set of possible measurements: 


R=ELRR@)I2)] = | RR@)I2)p(2)dz (3.11) 


Z 


Minimization of the integral is accomplished by minimization of the 
integrand. However, since p(z) is positive, it suffices to minimize 
R(x(z)|z). Therefore, the Bayes estimator not only minimizes the condi- 
tional risk, but also the overall risk. 

The Bayes solution is obtained by substituting the selected cost func- 
tion in (3.9) and (3.10). Differentiating and equating it to zero yields for 
the three cost functions given in (3.6), (3.7) and (3.8): 


e MMSE estimation (MMSE = minimum mean square error). 
e MMAE estimation (MMAE = minimum mean absolute error). 
e MAP estimation (MAP = maximum a posterior). 
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Table 3.1 gives the solutions that are obtained if x is a scalar. The 
MMSE and the MAP estimators will be worked out for the vectorial 
case in the next sections. But first, the scalar case will be illustrated by 
means of an example. 


Example 3.3 Estimation of the backscattering coefficient 
The estimators for the backscattering coefficient (see previous example) 
take the form as depicted in Figure 3.5. These estimators are found by 
substitution of (3.3) and (3.5) in the expressions in Table 3.1. 

In this example, the three estimators do not differ much. Never- 
theless their own typical behaviours manifest themselves clearly if we 
evaluate their results empirically. This can be done by means of the 


population of the Npop = 500 realizations that are shown in the figure. 


For each sample z; we calculate the average cost 1/Nopop yy (X(z;) |xi), 
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x —  MMAE estimator 
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Figure 3.5 Three different Bayesian estimators 
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Table 3.2 Empirical evaluation of the three different Bayes estimators in Figure 3.5 


Type of estimation 





MMSE MMAE MAP 
estimation estimation estimation 
5 5 Quadratic cost function 0.0067 0.0069 0.0081 
TE Absolute cost function 0.062 0.060 0.063 
“a= | Uniform cost function 0.26 0.19 0.10 
29 | (evaluated with A = 0.05) 


where x; is the true value of the i-th sample and £(.) is the estimator under 
test. Table 3.2 gives the results of that evaluation for the three different 
estimators and the three different cost functions. 

Not surprisingly, in Table 3.2 the MMSE, the MMAE and the 
MAP estimators are optimal with respect to their own criterion, i.e. 
the quadratic, the absolute value and the uniform cost criterion, 
respectively. It appears that the MMSE estimator is preferable if the 
cost of a large error is considerably higher than the one of a small 
error. The MAP estimator does not discriminate between small or 
large errors. The MMAE estimator takes its position in between. 

MATLAB code for generating Figure 3.5 is given in Listing 3.1. It 
uses the Statistics toolbox for calculating the various probability 
density functions. Although the MAP solution can be found analyt- 
ically, here we approximate all three solutions numerically. To avoid 
confusion, it is easy to create functions that calculate the various 
probabilities needed. Note how p(z) = f p(z|x)p(x)dx is approxi- 
mated by a sum over a range of values of x, whereas p(x|z) is found 
by Bayes’ rule. 


Listing 3.1 
MATLAB code for MMSE, MMSA and MAP estimation in the scalar 
case. 


function estimates 
global N Np ab xrange; 


N=500; % Number of samples 
Np=8; % Number of looks 
a= 2y b= 53 % Beta distribution parameters 


x=0.005:0.005:1; 
£=0500520 00521 S37 
load scatter; 
xrange=x; 


oe 


Interesting range of x 


oe 


Interesting range of z 


oe 


Load set (for plotting only) 
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for i=1:length(z) 


[dummy , ind] =max (px_z(x,z(i))); x_map(i) =x(ind); 
x_mse(i) =sum(pz_x(z(i),x).*px(x) .*x) 
./.sum(pz_x(z(i),x).*px(x)); 
ind= find((cumsum(px_z(x,z(i))) ./ .sum(px_z(x,z(i))))>0.5); 
x_mae(i) =x(ind(1)); 
end 
figure; clf; plot (zset,xset,’.’); hold on; 
plot (z,x_map,‘k-.’); plot(z,x_mse, ‘k--'); 


plot (z,x_mae, 'k-'); 
legend(‘realizations’, ‘MAP’, ‘MSE’, ‘MAE’); 
return 





function ret =px (x) 
global a b; ret=betapdf(x,a,b); 
return 


function ret =pz_x(z,x) 

global Np; ret = (z>0).*(Np./x) .*gampdf (Np*z./x,Np,1); 
return 
function ret =pz(z) 

global xrange; ret=sum(px(xrange) .*pz_x(z,xrange) ); 
return 


function ret = px_z(x,Z) 
ret =pz_x(z,x) .*px(x)./pz(z); 
return 


3.1.1 MMSE estimation 


The solution based on the quadratic cost function (3.6) is called the 
minimum mean square error estimator, also called the minimum vari- 
ance estimator for reasons that will become clear in a moment. Sub- 
stitution of (3.6) and (3.9) in (3.10) gives: 


suse) = argmin{ | (&-x)"(R—x)plxlz}dxb (8.12) 


X 


Differentiating the function between braces with respect to X (see Appen- 
dix B.4), and equating this result to zero yields a system of M linear 
equations, the solution of which is: 


PN E | xp(x\z)dx = E[x|z] (3.13) 
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The conditional risk of this solution is the sum of the variances of the 
estimated parameters: 


R(Xmse(Z)|Z) = | (Xmase(z) — x) (&mmse(z) — x)p(x|z)dx 


x 


= | Elle] - x)" Ell] — mp txle) dx (3.14) 


X 


M-1 
= `> Var[xm|Z] 
m=0 


Hence the name ‘minimum variance estimator’. 


3.1.2 MAP estimation 


If the uniform cost function is chosen, the conditional risk (3.9) 
becomes: 


R(klz) = / p(xle)dx — plz) (3.15) 


The estimate which now minimizes the risk is called the maximum a 
posterior (MAP) estimate: 


XMAP(Z) = argmax{p(x\|z)} (3.16) 


This solution equals the mode (maximum) of the posterior probability. It 
can also be written entirely in terms of the prior probability densities and 
the conditional probabilities: 


p(z|x)P(x) 


p(z) } = argmax{plejx) p(o} (3.17) 


Xmap(Z) = argmax{ 


This expression is similar to the one of MAP classification; see (2.12). 
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3.1.3 The Gaussian case with linear sensors 


Suppose that the parameter vector x has a normal distribution with 
expectation vector E[x] =, and covariance matrix Cx. In addition, 
suppose that the measurement vector can be expressed as a linear 
combination of the parameter vector corrupted by additive Gaussian 
noise: 


z= Hx+v (3.18) 


where v is an N-dimensional random vector with zero expectation and 
covariance matrix C,. x and v are uncorrelated. H is an N x M matrix. 

The assumption that both x and v are normal implies that the condi- 
tional probability density of z is also normal. The conditional expect- 
ation vector of z equals: 


E[z|x\ = Hix = Hu, (3.19) 


The conditional covariance matrix of z is Cy, = Cy. 

Under these assumptions the posterior distribution of x is normal as 
well. Application of (3.2) yields the following expressions for the MMSE 
estimate and the corresponding covariance matrix: 








Xmse(Z) = Msi, = E[x|z] = (ATC, 'H + Cy 1) (ATCS12 + Cl.) 
Cy, = (HCH + CF’) 


(3.20) 


The proof is left as an exercise for the reader. See exercise 3. Note that 
Hx,» being the posterior expectation, is the MMSE estimate Xmsz(Z). 

The posterior expectation, E[x|z], consists of two terms. The first term 
is linear in z. It represents the information coming from the measure- 
ment. The second term is linear in 4, representing the prior knowledge. 
To show that this interpretation is correct it is instructive to see what 
happens at the extreme ends: either no information from the measure- 
ment, or no prior knowledge: 


e The measurements are useless if the matrix H is virtually zero, or if 
the noise is too large, i.e. Cy 1 is too small. In both cases, the second 
term in (3.20) dominates. In the limit, the estimate becomes 
XmmseE(Z) = HL, with covariance matrix Cy, i.e. the estimate is purely 
based on prior knowledge. 
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e On the other hand, if the prior knowledge is weak, i.e. if the 
variances of the parameters are very large, the inverse covariance 
matrix C,' tends to zero. In the limit, the estimate becomes: 


XMMSE(Z) = (H'C,'H) 'HTC7'z (3.21) 


In this solution, the prior knowledge, i.e. 4y, is completely ruled out. 


Note that the mode of a normal distribution coincides with the expec- 
tation. Therefore, in the linear-Gaussian case, MAP estimation and 
MMSE estimation coincide: Xymse(z) = Xmap(Z). 


3.1.4 Maximum likelihood estimation 


In many practical situations the prior knowledge needed in MAP estima- 
tion is not available. In these cases, an estimator which does not depend 
on prior knowledge is desirable. One attempt in that direction is the 
method referred to as maximum likelihood estimation (ML estimation). 
The method is based on the observation that in MAP estimation, (3.17), 
the peak of the first factor p(z|x) is often in an area of x in which the 
second factor p(x) is almost constant. This holds true especially if 
little prior knowledge is available. In these cases, the prior density p(x) 
does not affect the position of the maximum very much. Discarding 
the factor, and maximizing the function p(z|x) solely, gives the ML 
estimate: 


XML(Z) = argmax{ p(z|x)} (3.22) 


Regarded as a function of x the conditional probability density is called the 
likelibood function. Hence the name ‘maximum likelihood estimation’. 

Another motivation for the ML estimator is when we change our 
viewpoint with respect to the nature of the parameter vector x. In the 
Bayesian approach x is a random vector, statistically defined by means 
of probability densities. In contrast, we may also regard x as a non- 
random vector whose value is simply unknown. This is the so-called 
Fisher approach. In this view, there are no probability densities asso- 
ciated with x. The only density in which x appears is p(z|x), but here x 
must be regarded as a parameter of the density of z. From all estimators 
discussed so far, the only estimator that can handle this deterministic 
point of view on x is the ML estimator. 
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Example 3.4 Maximum likelihood estimation of the backscattering 
coefficient 

The maximum likelihood estimator for the backscattering coefficient 
(see previous examples) is found by maximizing (3.5): 


Aom => ĉĝĉm(z)=z (3.23) 


The estimator is depicted in Figure 3.6 together with the MAP esti- 
mator. The figure confirms the statement above that in areas of flat 
prior probability density the MAP estimator and the ML estimator 
coincide. However, the figure also reveals that the ML estimator can 
produce an estimate of the backscattering coefficient that is larger 
than one; a physical impossibility. This is the price that we have to 
pay for not using prior information about the physical process. 
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Figure 3.6 MAP estimation, ML estimation and linear MMSE estimation 
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If the measurement vector z is linear in x and corrupted by additive 
Gaussian noise, as given in equation (3.18), the likelihood of x is given in 
(3.21). Thus, in that case: 


tmL(z) = (ATC;'H) 'HTC7'z (3.24) 


A further simplification is obtained if we assume that the noise is white, 
ie. Cy w I: 


îm (z) = (H"H) 'H"z (3.25) 


The operation (HTH) HT is the pseudo inverse of H. Of course, its 
validity depends on the existence of the inverse of HTH. Usually, such is 
the case if the number of measurements exceeds the number of para- 
meters, i.e. N > M. That is, if the system is overdetermined. 


3.1.5 Unbiased linear MMSE estimation 


The estimators discussed in the previous sections exploit full statistical 
knowledge of the problem. Designing such an estimator is often difficult. 
The first problem that arises is the adequate modelling of the probability 
densities. Such modelling requires detailed knowledge of the physical 
process and the sensors. Once we have the probability densities, the 
second problem is how to minimize the conditional risk. Analytic 
solutions are often hard to obtain. Numerical solutions are often 
burdensome. 

If we constrain the expression for the estimator to some mathematical 
form, the problem of designing the estimator boils down to finding the 
suitable parameters of that form. An example is the unbiased linear 
MMSE estimator with the following form:* 


XIMMSE(Z) = Kz +a (3.26) 


The matrix K and the vector a must be optimized during the design 
phase so as to match the behaviour of the estimator to the problem at 
hand. The estimator has the same optimization criterion as the MMSE 





?The connotation of the term ‘unbiased’ becomes clear in Section 3.2.1. The linear MMSE 
(without the adjective ‘unbiased’) also exists. It has the form X;yyse(z) = Kz. See exercise 1. 
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estimator, i.e. a quadratic cost function. The constraint results in an 
estimator that is not as good as the (unconstrained) MMSE estimator, 
but it requires only knowledge of moments up to the order two, i.e. 
expectation vectors and covariance matrices. 


The starting point is the overall risk expressed in (3.11). Together with 
the quadratic cost function we have: 


R= i (Kz + a — x)" (Kz + a — x)p(x|z)p(z)dxdz (3.27) 


The optimal unbiased linear MMSE estimator is found by minimizing 
R with respect to K and a. Hence we differentiate R with respect to a and 
equate the result to zero: 


“ = f | (2a + 2Kz — 2x)p(x|z)p(z)dxdz 
= 2a + 2Ku, — 2u, = 0 
yielding: 
a=1,-Ku, (3.28) 
with 4, and u, the expectations of x and z. 


Substitution of a back into (3.27), differentiation with respect to K, 
and equating the result to zero (see also Appendix B.4): 


R= | fke — 4) = (x — u,))" (K(z — 44) — (x — U,.))p(x|z)p(z)dxdz 


= trace(KC,K! + Cx — 2KC,x) 


dR 
K` 2KC, — 2Cx: = 0 


yields: 
K= CC; (3.29) 


C, is the covariance matrix of z, and C,, = E[(x — w,)(z —p,)"] the 
cross-covariance matrix between x and z. 
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Example 3.5 Unbiased linear MMSE estimation of the 
backscattering coefficient 
In the scalar case, the linear MMSE estimator takes the form: 


cov|x, 2] 
Varfz] 





XuIMMSE(2) = (z — E[z]) + E[x] (3.30) 
where cov[x,z] is the covariance of x and z. In the backscattering 
problem, the required moments are difficult to obtain analytically. 
However, they are easily estimated from the population of the 500 
realizations shown in Figure 3.6 using techniques from Chapter 5. 
The resulting estimator is shown in Figure 3.6. MATLAB code to plot 
the ML and unbiased linear MMSE estimates of the backscattering 
coefficient on a data set is given in Listing 3.2. 


Listing 3.2 
MATLAB code for unbiased linear MMSE estimation. 





load scatter; % Load dataset (zset,xset) 
z=0.,005:0.005:1.5; % Interesting range of z 

x_ml=z; % Maximum likelihood 

mu_x=mean (xset); mu_z=mean(zset); 

K= ((xset-mu_x) ’* (zset-mu_z) ) *inv((zset-mu_z) '*(zset-mu_z)); 
a = mu_x — K*mu_z; 

x_ulmse = K*z +a; % Unbiased linear MMSE 

figure; clf; plot (zset,xset,’.'); holdon; 

plot(z,x_ml,’k-'); plot(z,x_ulmse,’k--'); 


Linear sensors 


The linear MMSE estimator takes a particular form if the sensory system 
is linear and the sensor noise is additive: 


z= Hx+v (3.31) 


This case is of special interest because of its crucial role in the Kalman filter 
(to be discussed in Chapter 4). Suppose that the noise has zero mean with 
covariance matrix Cy. In addition, suppose that x and v are uncorrelated, 
i.e. Cyy = 0. Under these assumptions the moments of z are as follows: 


H, = Hey 

C, = HC,H' + G, 
ner = (3.32) 
Cx = HC, 


The proof is left as an exercise for the reader. 
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Substitution of (3.32), (3.28) and (3.29) in (3.26) gives rise to the 


following estimator: 


ŽummsE(Z) = Hy + K(z-Hy,) with K = C,H! (HC,H’ +C,) 
(3.33) 


This version of the unbiased linear MMSE estimator is the so-called 
Kalman form of the estimator. 

Examination of (3.20) reveals that the MMSE estimator in the Gaussian 
case with linear sensors is also expressed as a linear combination of 4, and 
z. Thus, in this special case (that is, Gaussian densities + linear sensors) 
Xmmse(z) is a linear estimator. Since X,yuse(z) and XymsE(z) are based on 
the same optimization criterion, the two solutions must be identical here: 
XMMSE(Z) = Xumse(Z). We conclude that (3.20) is an alternative form of 
(3.33). The forms are mathematically equivalent. See exercise 5. 

The interpretation of X,,ymse(Z) is as follows. The term 4, represents 
the prior knowledge. The term Hw, is the prior knowledge that we have 
about the measurements. Therefore, the factor z — Hw, is the informa- 
tive part of the measurements (called the innovation). The so-called 
Kalman gain matrix K transforms the innovation into a correction term 
K(z — Hu,) that represents the knowledge that we have gained from the 
measurements. 


3.2 PERFORMANCE OF ESTIMATORS 


No matter which precautions are taken, there will always be a difference 
between the estimate of a parameter and its true (but unknown) value. 
The difference is the estimation error. An estimate is useless without an 
indication of the magnitude of that error. Usually, such an indication is 
quantified in terms of the so-called bias of the estimator, and the vari- 
ance. The main purpose of this section is to introduce these concepts. 

Suppose that the true, but unknown value of a parameter is x. An 
estimator X(.) provides us with an estimate $ = X(z) based on measure- 
ments z. The estimation error e is the difference between the estimate 
and the true value: 


e=x-x (3.34) 


Since x is unknown, e is unknown as well. 
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3.2.1 Bias and covariance 


The error e is composed of two parts. One part is the one that does not 
change value if we repeat the experiment over and over again. It is the 
expectation of the error, called the bias. The other part is the random 
part and is due to sensor noise and other random phenomena in the 
sensory system. Hence, we have: 


error = bias + random part 


If x is a scalar, the variance of an estimator is the variance of e. As such 
the variance quantifies the magnitude of the random part. If x is a vector, 
each element of e has its own variance. These variances are captured in 
the covariance matrix of e, which provides an economical and also a 
more complete way to represent the magnitude of the random error. 

The application of the expectation and variance operators to e needs 
some discussion. Two cases must be distinguished. If x is regarded as a 
non-random, unknown parameter, then x is not associated with any 
probability density. The only randomness that enters the equations is 
due to the measurements z with density p(z|x). However, if x is regarded 
as random, it does have a probability density. We have two sources of 
randomness then, x and z. 

We start with the first case which applies to, for instance, the max- 
imum likelihood estimator. Here, the bias b(x) is given by: 


b(x) Ex — x|x| 


(3.35) 
= | (&(2) — xplelxaz 
The integral extends over the full space of z. In general, the bias depends 
on x. The bias of an estimator can be small or even zero in one area of x, 
whereas in another area the bias of that same estimator might be large. 
In the second case, both x and z are random. Therefore, we define an 
overall bias b by taking the expectation operator now with respect to 
both x and z: 


bÝER — 


ae — x)p(x,z)dzdx ro 


The integrals extend over the full space of x and z. 
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The overall bias must be considered as an average taken over the full 
range of x. To see this, rewrite p(x,z) = p(z|x)p(x) to yield: 


b= J oopa (3.37) 


where b(x) is given in (3.35). 

If the overall bias of an estimator is zero, then the estimator is said to 
be unbiased. Suppose that in two different areas of x the biases of an 
estimator have opposite sign, then these two opposite biases may cancel 
out. We conclude that, even if an estimator is unbiased (i.e. its overall 
bias is zero), then this does not imply that the bias for a specific value of 
x is zero. Estimators that are unbiased for every x are called absolutely 
unbiased. 

The variance of the error, which serves to quantify the random fluc- 
tuations, follows the same line of reasoning as the one of the bias. First 
we determine the covariance matrix of the error with non-random x: 


C(x) ZE [(e — Efel) (e — Elel)" Ix] 


(3.38) 
z few — x — b(x))(&(z) — x — b(x))"p(z|x)dz 


As before, the integral extends over the full space of z. The variances of 
the elements of e are at the diagonal of C,(x). 

The magnitude of the full error (bias + random part) is quantified by 
the so-called mean square error (the second order moment matrix of the 
error): 


(3.39) 
= |) -IRG — x)" plalx)de 
It is straightforward to prove that: 
M.(x) = b(x)b’ (x) + C(x) (3.40) 


This expression underlines the fact that the error is composed of a bias 
and a random part. 

The overall mean square error Me is found by averaging M,(x) over all 
possible values of x: 
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Mt 

“Efe 7) = | M(x M.( (3.41) 

Finally, the overall covariance matrix of the estimation error is found as: 


C.=E|(e—b)(e- b)"] 
(3.42) 
= Me — bb” 


The diagonal elements of this matrix are the overall variances of the 
estimation errors. 

The MMSE estimator and the unbiased linear MMSE estimator are 
always unbiased. To see this, rewrite (3.36) as follows: 


b= J f Goms) — x)p(x|z)p(z)dxdz 


2 i f (Elx|z] aadd 


The inner integral is identical to zero, and thus b must be zero. The proof 
of the unbiasedness of the unbiased linear MMSE estimator is left as an 
exercise. 

Other properties related to the quality of an estimator are stability and 
robustness. In this context, stability refers to the property of being 
insensitive to small random errors in the measurement vector. Robust- 
ness is the property of being insensitive to large errors in a few elements 
of the measurements (outliers); see Section 3.3.2. Often, the enlargement 
of prior knowledge increases both the stability and the robustness. 


(3.43) 


Example 3.6 Bias and variance in the backscattering application 
Figure 3.7 shows the bias and variance of the various estimators 
discussed in the previous examples. To enable a fair comparison 
between bias and variance in comparable units, the square root of 
the latter, i.e. the standard deviation, has been plotted. Numerical 
evaluation of (3.37), (3.41) and (3.42) yields:? 





3 In this example, the vector b(x) and the matrix C(x) turn into scalars because here x is a scalar. 
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Figure 3.7 The bias and the variance of the various estimators in the backscattering 
problem 


bumse = 0 OMMSE = y Cmmse = 0.086 
bummse = 0 Cummse = V'Cummse = 0.094 
bm =0 ome = VCmi = 0.116 
a OMAP = V Cap = 0.087 


From this, and from Figure 3.7, we observe that: 


e The overall bias of the ML estimator appears to be zero. So, in this 
example, the ML estimator is unbiased (together with the two 
MMSE estimators which are intrinsically unbiased). The MAP 
estimator is biased. 

e Figure 3.7 shows that for some ranges of x the bias of the MMSE 
estimator is larger than its standard deviation. Nevertheless, the 
MMSE estimator outperforms all other estimators with respect to 
overall bias and variance. Hence, although a small bias is a desir- 
able property, sometimes the overall performance of an estimator 
can be improved by allowing a larger bias. 

e The ML estimator appears to be linear here. As such, it is comparable 
with the unbiased linear MMSE estimator. Of these two linear esti- 
mators, the unbiased linear MMSE estimator outperforms the ML 
estimator. The reason is that — unlike the ML estimator — the uIMMSE 
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estimator exploits prior knowledge about the parameter. In addition, 
the ulMMSE estimator is more apt to the evaluation criterion. 

e Of the two nonlinear estimators, the MMSE estimator outperforms 
the MAP estimator. The obvious reason is that the cost function of 
the MMSE estimator matches the evaluation criterion. 

e Of the two MMSE estimators, the nonlinear MMSE estimator outper- 
forms the linear one. Both estimators have the same optimization 
criterion, but the constraint of the ul MMSE degrades its performance. 


3.2.2 The error covariance of the unbiased linear MMSE 
estimator 


We now return to the case of having linear sensors, z = Hx + v, as 
discussed in Section 3.1.5. The unbiased linear MMSE estimator 
appeared to be (see eq. (3.33)): 


ŽummsE(Z) = Hy + K(z —Hy,) with K = C,H" (HC,H’ + C,) E 


where Cy and Cx are the covariance matrices of v and x. 4y is the (prior) 
expectation vector of x. As said before, the X,;ymsz(.) is unbiased. 

Due to the unbiasedness of X,jyqmsp(-), the mean of the estimation 
error e = X,j;ymse(.) — X is zero. The error covariance matrix, Ce, of e 
expresses the uncertainty that remains after having processed the meas- 
urements. Therefore, Ce is identical to the covariance matrix associated 
with the posterior probability density. It is given by (3.20): 


Ce = Cre = (Cz! + H™C71H) (3.44) 


The inverse of a covariance matrix is called an information matrix. For 
instance, C7" is a measure of the information provided by the estimate 
%mse(-)- If the norm of C7" is large, then the norm of Ce must be small 
implying that the uncertainty in X,ymse(.) is small as well. Equation 
(3.44) shows that Ce is made up of two terms. The term Cy 1 represents 
the prior information provided by u. The matrix C7" represents the 
information that is given by z about the vector Hx. Therefore, the matrix 
H'C,'H represents the information about x provided by z. The two 
sources of information add up. So, the information about x provided by 
X,mose(.) is Co! = Cy! +H'C,'H. 
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Using the matrix inversion lemma (b.10) the expression for the error 
covariance matrix can be given in an alternative form: 


Ce = Cx — KSK” with: S = HC,HT + Cn (3.45) 


The matrix S is called the innovation matrix because it is the covariance 
matrix of the innovations z — Hu. The factor K(z — Hu) is a correction 
term for the prior expectation vector Uy. Equation (3.45) shows that the 
prior covariance matrix Cx is reduced by the covariance matrix KSK” of 
the correction term. 


3.3 DATA FITTING 


In data-fitting techniques, the measurement process is modelled as: 
z=h(x)+v (3.46) 


where h(.) is the measurement function that models the sensory system, 
and v are disturbing factors, such as sensor noise and modelling errors. 
The purpose of fitting is to find the parameter vector x which ‘best’ fits 
the measurements z. 

Suppose that $ is an estimate of x. Such an estimate is able to ‘predict’ 
the modelled part of z, but it cannot predict the disturbing factors. Note 
that v represents both the noise and the unknown modelling errors. The 
prediction of the estimate x is given by h(x). The residuals € are defined 
as the difference between observed and predicted measurements: 


é=z—h(x) (3.47) 


Data fitting is the process of finding the estimate x that minimizes some 
error norm |le|| of the residuals. Different error norms (see Appendix 
A.1.1) lead to different data fits. We will shortly discuss two error norms. 


3.3.1 Least squares fitting 


The most common error norm is the squared Euclidean norm, also called 

the sum of squared differences (SSD), or simply the LS norm (least 

squared error norm): 
N- 


N-1 
leli= $e} = X Gn — Pn(®)* = (z - h($)) "(z -h(£)) (3.48) 
n=0 


n=0 


=. 
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The least squares fit, or least squares estimate (LS) is the parameter 
vector which minimizes this norm: 


%15(z) = argmin{ (z - h(x)) T(z — h(x))} (3.49) 


If v is random with a normal distribution, zero mean and covariance 
matrix Cy =07I, the LS estimate is identical to the ML estimate: 
ıs = mL. To see this, it suffices to note that in the Gaussian case the 
likelihood takes the form 


p(z|x) = Lo Goi) Ge œ) (3.50) 





The minimization of (3.48) is equivalent to the minimization of (3.50). 
If the measurement function is linear, that is z = Hx + v, and H is an 
N x M matrix having a rank M with M < N, then according to (3.25): 


%15(z) = mı (z) = (HTH) 'H"z (3.51) 


Example 3.7 Repeated measurements 

Suppose that a scalar parameter x is N times repeatedly measured 
using a calibrated measurement device: z, = x + v,. These repeated 
measurements can be represented by a vector z = [z1...zn]!. The 
corresponding measurement matrix is H=[1...1 |’. Since 
(HTH)! = 1/N, the resulting least squares fit is: 


waa lias 
xis = qH z= 


n=1 


In other words, the best fit is found by averaging the measurements. 


Nonlinear sensors 


If h(.) is nonlinear, an analytic solution of (3.49) is often difficult. One is 
compelled to use a numerical solution. For that, several algorithms exist, 
such as ‘Gauss-Newton’, ‘Newton-Raphson’, ‘steepest descent’ and 
many others. Many of these algorithms are implemented within 
MATLAB’s optimization toolbox. The ‘Gauss-Newton’ method will be 
explained shortly. 
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Assuming that some initial estimate x,,¢ is available we expand (3.46) 
in a Taylor series and neglect all terms of order higher than two: 


z=h(x)+v 
oh(x) 
Bora 


X=Xyef 


zZ h(X,ef) + H,ef (x = Xref) +v with: Hyef = (3.52) 


where H,, is the Jacobian matrix of h(.) evaluated at x,.¢, see Appendix 
B.4. With such a linearization, (3.51) applies. Therefore, the following 
approximate value of the LS estimate is obtained: 


=i 
XLs x Xref + (HEH) yes (z = h(x,.f)) (3.53) 


A refinement of the estimate could be achieved by repeating the proce- 
dure with the approximate value as reference. This suggests an iterative 
approach. Starting with some initial guess x(0), the procedure becomes 
as follows: 
R+ 1) =X(i) + (AHO) HO- hC) 
h(x) (3.54) 


with: H(i) = eg 





x=x(i) 


In each iteration, the variable i is incremented. The iterative process 
stops if the difference between x(i+1) and x(i) is smaller than some 
predefined threshold. The success of the method depends on whether the 
first initial guess is already close enough to the global minimum. If not, 
the process will either diverge, or get stuck in a local minimum. 


Example 3.8 Estimation of the diameter of a blood vessel 

In vascular X-ray imaging, one of the interesting parameters is the 
diameter of blood vessels. This parameter provides information about 
a possible constriction. As such, it is an important aspect in cardiolo- 
gic diagnosis. 

Figure 3.8(a) is a (simulated) X-ray image of a blood vessel of the 
coronary circulation. The image quality depends on many factors. 
Most important are the low-pass filtered noise (called quantum mot- 
tle) and the image blurring due to the image intensifier. 

Figure 3.8(b) shows the one-dimensional, vertical cross-section of 
the image at a location as indicated by the two black arrows in 
Figure 3.8(a). Suppose that our task is to estimate the diameter of the 
imaged blood vessel from the given cross-section. Hence, we define 
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Figure 3.8 LS estimation of the diameter D and the position yo of a blood vessel. (a) 
X-ray image of the blood vessel. (b) Cross-section of the image together with fitted 
profile. (c) The sum of least squared errors as a function of the diameter and the 
position 


a measurement vector z whose elements consist of the pixel grey values 
along the cross-section. 

The parameter of interest is the diameter D. However, other para- 
meters might be unknown as well, e.g. the position and orientation of 
the blood vessel, the attenuation coefficient or the intensity of the 
X-ray source. This example will be confined to the case where the only 
unknown parameters are the diameter D and the position yo of the 
image blood vessel in the cross-section. Thus, the parameter vector is 
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two-dimensional: x” =[D yo]. Since all other parameters are 
assumed to be known, a radiometric model can be worked out to a 
measurement function h(x) which quantitatively predicts the cross- 
section, and thus also the measurement vector z for any value of the 
parameter vector x. 

With this measurement function it is straightforward to calculate the 
LS norm |le||; for a couple of values of x. Figure 3.8(c) is a graphical 
representation of that. It appears that the minimum of ||e||} is obtained 
if Drs =0.42mm. The true diameter is D = 0.40mm. The thus 
obtained fitted cross-section is also shown in Figure 3.8(b). 

Note that the LS norm in Figure 3.8(c) is a smooth function of x. 
Hence, the convergence region of a numerical optimizer will be large. 


3.3.2 Fitting using a robust error norm 


Suppose that the measurement vector in an LS estimator has a few 
number of elements with large measurement errors, the so-called out- 
liers. The influence of an outlier is much larger than the one of the others 
because the LS estimator weights the errors quadraticly. Consequently, 
the robustness of LS estimation is poor. 

Much can be improved if the influence is bounded in one way or 
another. This is exactly the general idea of applying a robust error norm. 
Instead of using the sum of squared differences, the objective function of 
(3.48) becomes: 


N-1 N-1 
llEll obust= 3 P(En) = Plèn = h,(X)) (3.55) 
n=0 n=0 


p(.) measures the size of each individual residual z„ — h,(X). This meas- 
ure should be selected such that above a given level of €, its influence is 
ruled out. In addition, one would like to have p(.) being smooth so that 
numerical optimization of ||E|| obus is not too difficult. A suitable choice 
(among others) is the so-called Geman—McClure error norm: 


ple) (3.56) 


Eeto 
A graphical representation of this function and its derivative is shown in 
Figure 3.9. The parameter ø is a soft threshold value. For values of € 
smaller than about ø, the function follows the LS norm. For values larger 
than øg, the function gets saturated. Consequently, for small values of € the 
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Figure 3.9 A robust error norm and its derivative 


derivative pe(€) = Olé] ,ohusp/OX Of p(.) is nearly a constant. But for large 
values of £, i.e. for outliers, it becomes nearly zero. Therefore, in a 
Gauss—Newton style of optimization, the Jacobian matrix is virtually 
zero for outliers. Only residuals that are about as large as o or smaller 
than that play a role. 


Example 3.9 Robust estimation of the diameter of a blood vessel 
If in example 3.8 the diameter must be estimated near the bifurcation 
(as indicated in Figure 3.8(a) by the white arrows) a large modelling 
error occurs because of the branching vessel. See the cross-section in 
Figure 3.10(a). These modelling errors are large compared to the 
noise and they should be considered as outliers. However, Figure 
3.10(b) shows that the error landscape |le||; has its minimum at 
Dis = 0.50mm. The true value is D = 0.40mm. Furthermore, the 
minimum is less pronounced than the one in Figure 3.8(c), and there- 
fore also less stable. 

Note also that in Figure 3.10(a) the position found by the LS estimator 
is in the middle between the two true positions of the two vessels. 

Figure 3.11 shows the improvements that are obtained by applying 
a robust error norm. The threshold ø is selected just above the noise 
level. For this setting, the error landscape clearly shows two pro- 
nounced minima corresponding to the two blood vessels. The global 
minimum is reached at D,opus: = 0.44mm. The estimated position 
now clearly corresponds to one of the two blood vessels as shown in 
Figure 3.11 (a). 
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Figure 3.10 LS estimation of the diameter D and the position yo. (a) Cross-section 
of the image together with a profile fitted with the LS norm. (b) The LS norm as a 
function of the diameter and the position 


(a) (b) 








300 — observed 
-> fitted 


0 2 4 6 8 10 
y (mm) 














Figure 3.11 Robust estimation of the diameter D and the position yo. (a) Cross- 
section of the image together with a profile fitted with a robust error norm. (b) The 
robust error norm as a function of the diameter and the position 


3.3.3 Regression 


Regression is the act of deriving an empirical function from a set of 
experimental data. Regression analysis considers the situation involving 
pairs of measurements (t, z). The variable t is regarded as a measurement 
without any appreciable error. t is called the independent variable. 
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We assume that some empirical function f(.) is chosen that (hopefully) 
can predict z from the independent variable t. Furthermore, a parameter 
vector x can be used to control the behaviour of f(.). Hence, the model is: 


z=f(t,x) +e (3.57) 


f(.,-) is the regression curve, and £ represents the residual, i.e. the part of 
z that cannot be predicted by f(.,.). Such a residual can originate from 
sensor noise (or other sources of randomness) which makes the predic- 
tion uncertain, but it can also be caused by an inadequate choice of the 
regression curve. 

The goal of regression is to determine an estimate x of the parameter 
vector x based on N observations (tn, Zn), n = 0,...,N — 1 such that the 
residuals £„ are as small as possible. We can stack the observations z, in 
a vector z. Using (3.57), the problem of finding x can be transformed to 
the standard form of (3.47): 


Zo f (to, x) E0 
z=h(x)+e with: z| : | hW : r 
ZN-1 f(tn—-1,X) EN-1 
(3.58) 


where £ is the vector that embodies the residuals £ņ. 

Since the model is in the standard form, x can be estimated with a least 
squares approach as in Section 3.3.1. Alternatively, we use a robust error 
norm as defined in Section 3.3.2. The minimization of such a norm is 
called robust regression analysis. 

In the simplest case, the regression curve f(t, x) is linear in x. With that, 
the model becomes of the form z = Hx + £, and thus, the solution of 
(3.51) applies. As an example, we consider polynomial regression for 
which the regression curve is a polynomial of order M — 1: 


f(t, x) = xo + xit tty (3.59) 


If, for instance, M = 3, then the regression curve is a parabola described 
by three parameters. These parameters can be found by least squares 
estimation using the following model: 


z=.. ' lk+e (3.60) 
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Example 3.10 The calibration curve of a level sensor 
In this example, the goal is to determine a calibration curve of a level 
sensor to be used in a water tank. For that purpose, a second meas- 
urement system is available with a much higher precision than the 
‘sensor under test’. The measurement results of the second system 
serve as a reference. Figure 3.12 shows the observed errors of the 
sensor versus the reference values. Here, 46 pairs of measurements 
are shown. A zero order (fit with a constant), a first order (linear 
fit) and a tenth order polynomial are fitted to the data. As can be 
seen, the constant fit appears to be inadequate for describing the 
data (the model is too simple). The first order polynomial describes 
the data reasonably well, and is also suitable for extrapolation. The 
tenth order polynomial follows the measurement points better, but 
also the noise. It cannot be used for extrapolation because it is an 
example of overfitting the data. This occurs whenever the model 
has too many degrees of freedom compared to the number of data 
samples. 

Listing 3.3 illustrates how to fit and evaluate polynomials using 
MATLAB’s polyfit() and polyval() routines. 
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Figure 3.12 Determination of a calibration curve by means of polynomial 
regression 
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Listing 3.3 
MATLAB code for polynomial regression. 





load levelsensor; % Load dataset (t,z) 
figure; clf; plot(t,z,’k.’); holdon; % Plot it 
y=0:0.2:30; M=[1210]; plotstring={’k--',‘k-','‘k:'}; 
for m=1:3 
p=polyfit(t,z,M(m)-1); % Fit polynomial 
z_hat =polyval(p,y); % Calculate plot points 
plot(y,z_hat,plotstring{m}); % and plot them 
end; 


axis([030-0.30.2]); 


3.4 OVERVIEW OF THE FAMILY OF ESTIMATORS 


The chapter concludes with the overview shown in Figure 3.13. Two 
main approaches have been discussed, the Bayes estimators and the 
fitting techniques. Both approaches are based on the minimization of 
an objective function. The difference is that with Bayes, the objective 
function is defined in the parameter domain, whereas with fitting tech- 
niques, the objective function is defined in the measurement domain. 
Another difference is that the Bayes approach has a probabilistic con- 
text, whereas the approach of fitting lacks such a context. 

Within the family of Bayes estimators we have discussed two estima- 
tors derived from two cost functions. The quadratic cost function leads 
to MMSE (minimum variance) estimation. The cost function is such that 
small errors are regarded as unimportant, while larger errors are con- 
sidered more and more serious. The solution is found as the conditional 
mean, i.e. the expectation of the posterior probability density. The 
estimator is unbiased. 

If the MMSE estimator is constrained to be linear, the solution can be 
expressed entirely in terms of first and second order moments. If, in 
addition, the sensory system is linear with additive, uncorrelated noise, a 
simple form of the estimator appears that is used in Kalman filtering. 
This form is sometimes referred to as the Kalman form. 

The other Bayes estimator is based on the uniform cost function. This 
cost function is such that the damage of small and large errors are 
equally weighted. It leads to MAP estimation. The solution appears to 
be the mode of the posterior probability density. The estimator is not 
guaranteed to be unbiased. 
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Figure 3.13 A family tree of estimators 


It is remarkable that although the quadratic cost function and the unit 
cost function differ a lot, the solutions are identical provided that the 
posterior density is uni-modal and symmetric. An example of this occurs 
when the prior probability density and conditional probability density 
are both Gaussian. In that case the posterior probability density is 
Gaussian too. 

If no prior knowledge about the parameter is available, one can use 
the ML estimator. Another possibility is to resort to fitting techniques, of 
which the LS estimator is most popular. The ML estimator is essentially 
a MAP estimator with uniform prior probability. Under the assumptions 
of normal distributed sensor noise the ML solution and the LS solution 
are identical. If, in addition, the sensors are linear, the ML and LS 
estimator become the pseudo inverse. 
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A robust estimator is one which can cope with outliers in the measure- 
ments. Such an estimator can be achieved by application of a robust 
error norm. 
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3.6 EXERCISES 


1. Prove that the linear MMSE estimator, whose form is Xymsp(z) = Kz, is found as: 
= -1 : def T def T 
K=M,M, with M,,=E [xz ] and M,=E [zz ] (x) 


2. In the Gaussian case, in Section 3.1.3, we silently assumed that the covariance matrices 
C, and C, are invertible. What can be said about the elements of x if Cx is singular? 
And what about the elements of v if Cy is singular? What must be done to avoid such 
a situation? (*) 

3. Prove that, in Section 3.1.3, the posterior density is Gaussian, and prove equation 
(3.20). (*) Hint: use equation (3.2), and expand the argument of the exponential. 

4. Prove equation (3.32). (0) 


5. Use the matrix inversion lemma (b.10) to prove that the form given in (3.20): 
Xumse(z) = Ce(Cz'u, +H™Cy'z) with C, = (HTCZ'H + Ct) 


is equivalent to the Kalman form given in (3.33). (x) 

6. Explain why (3.42) cannot be replaced by Ce = f Ce(x)p(x)dx. (*) 

7. Prove that the unbiased linear MMSE estimator is indeed unbiased. (0) 

8. Given that the random variable z is binominally distributed (Appendix C.1.3) with 
parameters (x,M). x is the probability of success of a single trial. M is the number of 
independent trials. z is the number of successes in M trials. The parameter x must be 
estimated from an observed value of z. 


@ Develop the ML estimator for x. (0) 
@ What is the bias and the variance of this estimator? (0) 
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9. If, without having observed z, the parameter x in exercise 8 is uniformly distributed 


10. 


between 0 and 1, what will be the posterior density p(x|z) of x? Develop the MMSE 
estimator and the MAP estimator for this case. What will be the bias and the variance 
of these estimators? (+) 

A Geiger counter is an instrument that measures radioactivity. Essentially, it counts 
the number of events (arrival of nuclear particles) within a given period of time. 
These numbers are Poisson distributed with expectation A, i.e. the mean number of 
events within the period. z is the counted number of events within a period. We 
assume that A is uniform distributed between 0 and L. 


@ Develop the ML estimator for À. (0) 

@ Develop the MAP estimator for A. What is the bias and the variance of the ML 
estimator? (*) 

@ Show that the ML estimator is absolutely unbiased, and that the MAP estimator is 
biased. (x) 


@ Give an expression for the (overall) variance of the ML estimator. (0) 
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State Estimation 


The theme of the previous two chapters will now be extended to the case 
in which the variables of interest change over time. These variables can 
be either real-valued vectors (as in Chapter 3), or discrete class variables 
that only cover a finite number of symbols (as in Chapter 2). In both 
cases, the variables of interest are called state variables. 

The design of a state estimator is based on a state space model that 
describes the underlying physical process of the application. For 
instance, in a tracking application, the variables of interest are the 
position and velocity of a moving object. The state space model gives 
the connection between the velocity and the position (which, in this case, 
is a kinematical relation). Variables, like position and velocity, are real 
numbers. Such variables are called continuous states. 

The design of a state estimator is also based on a measurement model 
that describes how the data of a sensory system depend on the state 
variables. For instance, in a radar tracking system, the measurements are 
the azimuth and range of the object. Here, the measurements are directly 
related to the two-dimensional position of the object if represented in 
polar coordinates. 

The estimation of a dynamic class variable, i.e. a discrete state variable 
is sometimes called mode estimation or labelling. An example is in 
speech recognition where — for the recognition of a word — a sequence 
of phonetic classes must be estimated from a sequence of acoustic 
features. Here too, the analysis is based on a state space model and a 
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measurement model (in fact, each possible word has its own state space 
model). 

The outline of the chapter is as follows. Section 4.1 gives a framework 
for estimation in dynamic systems. It introduces the various concepts, 
notations and mathematical models. Next, it presents a general scheme 
to obtain the optimal solution. In practice, however, such a general 
scheme is of less value because of the computational complexity involved 
when trying to implement the solution directly. Therefore, the general 
approach needs to be worked out for different cases. Section 4.2 is 
devoted to the case of continuous state variables. Practical solutions 
are feasible if the models are linear-Gaussian (Section 4.2.1). If the 
model is not linear, one can resort to suboptimal methods (Section 4.2.2). 
Section 4.3 deals with the discrete state case. The chapter finalizes 
with Section 4.4 which contains an introduction to particle filtering. 
This technique can handle nonlinear and non-Gaussian models 
covering the continuous and the discrete case, and even mixed cases 
(i.e. combinations of continuous and discrete states). 

The chapter confines itself to the theoretical aspects of state estima- 
tion. Practical issues, like implementations, deployment, consistency 
checks are dealt with in Chapter 8. The use of MATLAB is also deferred 
to that chapter. 


4.1 A GENERAL FRAMEWORK FOR ONLINE 
ESTIMATION 


Usually, the estimation problem is divided into three paradigms: 


e online estimation (optimal filtering) 
e prediction 
e retrodiction (smoothing, offline estimation). 


Online estimation is the estimation of the present state using all the 
measurements that are available, i.e. all measurements up to the present 
time. Prediction is the estimation of future states. Retrodiction is the 
estimation of past states. 

This section sets up a framework for the online estimation of the states 
of time-discrete processes. Of course, most physical processes evolve in 
the continuous time. Nevertheless, we will assume that these systems 
can be described adequately by a model where the continuous time is 
reduced to a sequence of specific times. Methods for the conversion from 
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time-continuous to time-discrete models are described in many text- 
books, for instance, on control engineering. 


4.1.1 Models 


We assume that the continuous time ¢ is equidistantly sampled with 
period A. The discrete time index is denoted by an integer variable i. 
Hence, the moments of sampling are t; = iA. Furthermore, we assume 
that the estimation problem starts at t = 0. Thus, i is a non-negative 
integer denoting the discrete time. 


The state space model 


The state at time 7 is denoted by x(i) € X where X is the state space. For 
discrete states, X = Q = {w4,...,wx} where w; is the k-th symbol (label, 
or class) out of K possible classes. For real-valued vectors with dimen- 
sion M, we have X = R™. 

Suppose for a moment that we have observed the state of a process 
during its whole history, i.e. from the beginning of time up to the 
present. In other words, the sequence of states x(0),x(1),...,x(é) are 
observed and as such fully known. i denotes the present time. In add- 
ition, suppose that — using this sequence — we want to estimate (predict) 
the next state x(i+ 1). Assuming that the states can be modelled as 
random variables, we need to evaluate the conditional probability dens- 
ity’ p(x(i+1)|x(0),x(1),...,x(2)). Once this probability density is 
known, the application of the theory in Chapters 2 and 3 will provide 
the optimal estimate of x(i + 1). For instance, if X is a real-valued vector 
space, the Bayes estimator from Chapter 3 provides the best prediction 
of the next state (the density p(x(i + 1)|x(0), x(1),...,x()) must be used 
instead of the posterior density). 

Unfortunately, the evaluation of p(x(i+ 1)|x(0),x(1),...,x(z)) is a 
nasty task because it is not clear how to establish such a density in 
real-world problems. Things become much easier if we succeed to define 
the state such that the so-called Markov condition applies: 


p(x(i + 1)|x(0), x(1),...,x(a)) = p(x(¢ + 1)|x()) (4.1) 





1 For the finite-state case, the probability densities transform into probabilities, and appropriate 
summations replace the integrals. 
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The probability of x(i+ 1) depends solely on x(i) and not on the past 
states. In order to predict x(i+ 1), the knowledge of the full history is 
not needed. It suffices to know the present state. If the Markov condition 
applies, the state of a physical process is a summary of the history of the 
process. 


Example 4.1 The density of a substance mixed with a liquid 
Mixing and diluting are tasks frequently encountered in the food 
industries, paper industry, cement industry, and so. One of the param- 
eters of interest during the production process is the density D(t) of 
some substance. It is defined as the fraction of the volume of the mix 
that is made up by the substance. 

Accurate models for these production processes soon involve a 
large number of state variables. Figure 4.1 is a simplified view of the 
process. It is made up by two real-valued state variables and one 
discrete state. The volume V(t) of the liquid in the barrel is regulated 
by an on/off feedback control of the input flow f,(t) of the liquid: 
fi(t) = fox(t). The on/off switch is represented by the discrete state 
variable x(t) € {0,1}. A hysteresis mechanism using a level detector 
(LT) prevents jitter of the switch. x(t) switches to the ‘on’ state (=1) if 
V(t) < Viow, and switches back to the ‘off? state (=0) if V(t) > Voign. 

The rate of change of the volume is V(t) = filt) + f(t) — f(t) 
with f(t) the volume flow of the substance, and f3(t) the output 
volume flow of the mix. We assume that the output flow is gov- 
erned by Torricelli’s law: f3(t) = c\/V(t)/V,f. The density is defined 
as D(t) = Vs(t)/V(t) where Vs(t) is the volume of the substance. The 
rate of change of Vs(t) is: Vs(t) = fa(t) — D(t)f;(t). After some 
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Figure 4.1 A density control system for the process industry 
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manipulations the following system of differential equations 
appears: 


(4.2) 





A discrete time approximation? (notation: V(i) = V(iA), D(i) = D(iA), 
and so on) is: 


Viit1) ~ V( + A (fox(i) + foi) — V(i)/Vrer) 


pi+1) =D) — Ak*@P@ = A(A ~ DU) (4.3) 








This equation is of the type x(i+ 1) = f(x(i),u(i), w(i)) with 
x(i) = [ V(i) D(i) x(i)]'. The elements of the vector u(i) are the 
known input variables, i.e. the non-random part of f} (i). The vector 
w(i) contains the random input, i.e. the random part of f:(i). The 
probability density of x(i+ 1) depends on the present state x(i), but 
not on the past states. 

Figure 4.1 shows a realization of the process. Here, the substance is 
added to the volume in chunks with an average volume of 10 litre and 
at random points in time. 


If the transition probability density p(x(i + 1)|x(i)) is known together 
with the initial probability density p(x(0)), then the probability density 
at an arbitrary time can be determined recursively: 


p(x(i+ =f p(x(i + 1)|x(é))p(x(i))dx fori=0,1,... (4.4) 


x(i)EX 





> The approximation that is used here is V(t)A œ Vili + 1)A) — V(iA). The approximation is 
only close if A is sufficiently small. Other approximations may be more accurate, but this 
subject is outside the scope of the book. 
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The joint probability density of the sequence x(0),x(1),...,x(i) follows 
readily 


! (4.5) 


The measurement model 


In addition to the state space model, we also need a measurement model 
that describes the data from the sensor in relation with the state. Suppose 
that at moment i the measurement data is z(i)€ Z where Z is the 
measurement space. For the real-valued state variables, the measurement 
space is often a real-valued vector space, i.e. Z = R^. For the discrete 
case, one often assumes that the measurement space is also finite, i.e. 
Z = {%,... ÛN}. 

The probabilistic model of the sensory system is fully defined by the 
conditional probability density p(z(i)|x(0),...,x(i),z(0),...,z(i—1)). 
We assume that the sequence of measurements starts at time i= 0. 
In order to shorten the notation, the sequence of all measurements up to 
the present will be denoted by: 


Z(i) = {2(0),...,z(é)} (4.6) 


We restrict ourselves to memoryless sensory systems, i.e. systems where 
z(i) depends on the value of x(i), but not on previous states nor on 
previous measurements. In other words: 


p(z(i)/x(0),.--,x(¢), Z — 1) = p(z@)x@)) (4.7) 


4.1.2 Optimal online estimation 


Figure 4.2 presents an overview of the scheme for the online estimation 
of the state. The connotation of the phrase online is that for each time 
index i an estimate x(i) of x() is produced based on Z(i), i.e. based on 
all measurements that are available at that time. The crux of optimal 
online estimation is to maintain the posterior density p(x(i)|Z(i)) for 
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Figure 4.2 An overview of online estimation 


running values of i. This density captures all the available information of 
the current state x(i) after having observed the current measurement and 
all previous ones. With the availability of the posterior density, the 
methods discussed in Chapters 2 and 3 become applicable. The only 
work to be done, then, is to adopt an optimality criterion and to work 
this out using the posterior density to get the optimal estimate of the 
current state. 

The maintenance of the posterior density is done efficiently by means 
of a recursion. From the posterior density p(x(i)|Z(i)), valid for the 
current period i, the density p(x(i + 1)|Z(i + 1)), valid for the next 
period i+ 1, is derived. The first step of the recursion cycle is a predic- 
tion step. The knowledge about x(i) is extrapolated to knowledge about 
x(i+ 1). Using Bayes’ theorem for conditional probabilities in combina- 
tion with the Markov condition (4.1), we have: 


p(x(i + 1)|Z(i)) = f p(x(i+ 1), x(|Z(i)) dx(i) 


x(i)EX 


=f Pai + DKO ZOPO) (4.8) 








=f pit DKOPEOIZO)d 


At this point, we increment the counter i, so that p(x(i + 1)|Z(i)) now 
becomes p(x(a)|Z( — 1)). The increment can be done anywhere in the 
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loop, but the choice to do it at this point leads to a shorter notation of 
the second step. 

The second step is an update step. The knowledge gained from observ- 
ing the measurement z(i) is used to refine the density. Using — once 
again — the theorem of Bayes, now in combination with the conditional 
density for memoryless sensory systems (4.7): 


PEIZE) = p(x) [ZG — 1), z(i)) 
E T pel), Zi = 1))p()|Z@— 1)) (4.9) 
= plali)x(i)) p(x |ZG -=1)) 


where c is a normalization constant: 
c5 P SPUOEEROZE= 1))dx(i) (4.10) 
xX(1)E 


The recursion starts with the processing of the first measurement z(0). 
The posterior density p(x(0)|Z(0)) is obtained using p(x(0)) as the prior. 
The outline for optimal estimation, expressed in (4.8), (4.9) and (4.10), 
is useful in the discrete case (where integrals turn into summations). For 
continuous states, a direct implementation is difficult for two reasons: 


e It requires efficient representations for the N- and M-dimensional 
density functions. 

e It requires efficient algorithms for the integrations over an 
M-dimensional space. 


Both requirements are hard to fulfil, especially if M is large. Nonetheless, 
many researchers have tried to implement the general scheme. One of the 
most successful endeavours has resulted in what is called particle filtering 
(see Section 4.4). But first, the discussion will be focused on special cases. 


4.2 CONTINUOUS STATE VARIABLES 


This session addresses the case when both the state and the measure- 
ments are real-valued vectors with dimensions of M and N, respectively. 
The starting point is the general scheme for online estimation discussed 
in the previous section, and illustrated in Figure 4.2. As said before, in 


CONTINUOUS STATE VARIABLES 89 


general, a direct implementation of the scheme is difficult. Fortunately, 
there are circumstances which allow a fast implementation. For instance, 
in the special case, where the models are linear and the disturbances have 
a normal distribution, an implementation based on an ‘expectation and 
covariance matrix’ representation of the probability densities is feasible 
(Section 4.2.1). If the models are nonlinear, but the nonlinearity is 
smooth, linearization techniques can be applied (Section 4.2.2). If the 
models are highly nonlinear, but the dimensions N and M are not too 
large, numerical methods are possible (Section 4.2.3 and 4.4). 


4.2.1 Optimal online estimation in linear-Gaussian systems 


Most literature in optimal estimation in dynamic systems deals with the 
particular case in which both the state model and the measurement 
model are linear, and the disturbances are Gaussian (the linear-Gaussian 
systems). Perhaps the main reason for the popularity is the mathematical 
tractability of this case. 


Linear-Gaussian state space models 


The state model is said to be linear if the transition from one state to the 
next can be expressed by a so-called linear system equation (or: linear 
state equation, linear plant equation, linear dynamic equation): 


x(i + 1) = F(é)x(i) + Luli) + w(i) (4.11) 


F(i) is the system matrix. It is an M x M matrix where M is the dimen- 
sion of the state vector. M is called the order of the system. The vector 
u(i) is the control vector (input vector) of dimension L. Usually, the 
vector is generated by a controller according to some control law. As 
such the input vector is a deterministic signal that is fully known, at least 
up to the present. L(ż) is the gain matrix of dimension M x L. Sometimes 
the matrix is called the distribution matrix as it distributes the control 
vector across the elements of the state vector. 

w(i) is the process noise (system noise, plant noise). It is a sequence of 
random vectors of dimension? M. The process noise represents the 





3 Sometimes the process noise is represented by G(i)w(i) where G(i) is the noise gain matrix. 
With that, w(i) is not restricted to have dimension M. Of course, the dimension of G(z) must be 
appropriate. 
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unknown influences on the system, for instance, formed by disturbances 
from the environment. The process noise can also represent an unknown 
input/control signal. Sometimes process noise is also used to take care of 
modelling errors. The general assumption is that the process noise is a 
white random sequence with normal distribution. The term ‘white’ is 
used here to indicate that the expectation is zero and the autocorrelation 
is governed by the Kronecker delta function: 


E[w(i)] = 0 

(4.12) 
E[w(iw"(/)] = Cw()8(i,/) 
Cy(i) is the covariance matrix of w(i). Since w(i) is supposed to have 
a normal distribution with zero mean, Cy(i) defines the density of w(i) 
in full. 

The initial condition of the state model is given in terms of the 
expectation E[x(0)] and the covariance matrix C,(0). In order to find 
out how these parameters of the process propagate to an arbitrary time å, 
the state equation (4.11) must be used recursively: 


E[x(i + 1)] = FOEK] + Luli 


| ee (4.13) 
Cy(i + 1) = F(C (iF (i) + Cy (2) 
The first equation follows from E[w(i)] = 0. The second equation uses 
the fact that the process noise is white, i.e. E[w(i)w! (j)] = 0 for i 4 j. See 
(4.12). 

If E[x(0)] and C,(0) are known, then equation (4.13) can be used to 
calculate E[x(1)] and C,(1). From that, by reapplying the (4.13), the next 
values, E[x(2)] and C,(2), can be found, and so on. Thus, the iterative 
use of equation (4.13) gives us E[x(i)] and C,(é) for arbitrary i > 0. 

In the special case, where neither F(i) nor C,,(i) depend on i, the state 
space model is time invariant. The notation can be shortened then by 
dropping the index, i.e. F and Cy. If, in addition, F is stable (the 
magnitudes of the eigenvalues of F are all less than one; Appendix 
D.3.2), the sequence C,(i), i= 0,1, ... converges to a constant matrix. 
The balance in (4.13) is reached when the decrease of Cx(i) due to F 
compensates the increase due to Cy. If such is the case, then: 


Cx = FCF" + Cy (4.14) 


This is the discrete Lyapunov equation. 
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Some special state space models 


In this section, we introduce some elementary random processes. They 
are presented here not only to illustrate the properties of state models 
with random inputs, but also because they are often used as building 
blocks for models of complicated physical processes. 


Random constants Sometimes it is useful to model static parameters 
as states in a dynamic system. In that case, the states do not change in 
time: 


x(i + 1) =x(i) (4.15) 


Such a model is useful when the sequential measurements z(i) of x are 
processed online so that the estimate of x becomes increasingly accurate 
as time proceeds. 


First order autoregressive models A first order autoregressive (AR) 
model is of the type 


x(i + 1) = ax(i) + w(i) (4.16) 


where w(i) is a white, zero mean, normally distributed sequence 
with variance o%,. In this particular example, o2(i) = C,(i) since (i) 

2(i) can be expressed in closed form: 02(00) = a7/o2(0)+ 
(1 — a7!**)o? /(1 — a’). The equation holds if a #1. The system is 
stable provided that |a| < 1. In that case, the term a*/o2(0) exponentially 
fades out. The second term asymptotically reaches the steady state, i.e. 
the solution of the Lyapunov equation: 


is a scalar. o 





x 


o2(00) = T L o, (4.17) 


If |a| > 0, the system is not stable, and both terms grow exponentially. 

First order AR models are used to describe slowly fluctuating phe- 
nomena. Physically, such phenomena occur when broadband noise is 
dampened by a first order system, e.g. mechanical shocks damped by a 
mass/dampener system. Processes that involve exponential growth are 
also modelled by first AR models. Figure 4.3 shows a realization of a 
stable and an unstable AR process. 
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Figure 4.3 First order autoregressive models. (a) Schematic diagram of the model. 
(b) A stable realization. (c) An unstable realization 


Random walk Consider the process: 
x(i+1)=<x(i)+w(i) with x(0)=0 (4.18) 


w(i) is a random sequence of independent increments, +d, and decre- 
ments, —d; each occurs with probability 12, Suppose that after i time 
steps, the number of increments is (i), then the number of decrements is 
i — n(i). Thus, x(i)/d = 2n(i) —i. The variable n(i) has a binomial dis- 
tribution (Appendix C.1.3) with parameters (i,'/4). Its mean value is '/ i; 
hence E[x(i)] = 0. The variance of n(i) is 4i. Therefore, o2(i) = id". 
Clearly, o2(i) is not limited, and the solution of the Lyapunov equation 
does not exist. 

According to the central limit theorem (Appendix C.1.4), after about 
20 time steps the distribution of x(i) is reasonably well approximated by 
a normal distribution. Figure 4.4 shows a realization of a random walk 
process. Random walk processes find application in navigation 
problems. 


Second order autoregressive models Second order autoregressive 
models are of the type: 


x(i+ 1) = ax(i) + Bx(i-— 1) + w(i) (4.19) 
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Figure 4.4 Random walk 


The model can be cast into a state space model by defining 
x(i) Siea ai=: 


ete Poe | 2 bi 0 [x00 $ kd (4.20) 


The eigenvalues of this system are h at") \/a2 + 46. If a? > —4(, the 
system can be regarded as a cascade of two first order AR processes with 
two real eigenvalues. However, if a* < —43, the eigenvalues become 
complex and can be written as de*?”/ with j = /—1. The magnitude of 
the eigenvalues, i.e. the damping, is d = \/—(3. The frequency f is found 
by the relation cos 2rf = |a|/(2d). The solution of the Lyapunov equa- 
tion is obtained by multiplying (4.19) on both sides by x(i+ 1), x(i) and 
x(i— 1), and taking the expectation: 











E[x(i+ 1)x(é+ 1)] = Eļax(i)x(i + 1) + Bx(i— 1)x(i + 1) + w(i)x(i + 1)] 
E[x(i + 1)x(i)] =E[ax(i)x(i) + Bx(i — 1)x(i) + w(i)x(i)] 
E[x(i+ 1)x(i— 1)] = Eļax(i)x(i — 1) + Bx(i-— 1)x(i — 1) + w(i)x(i — 1)] 

















< 
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The equations are valid if the system is in the steady state, i.e. 
when o2(i) =02(i+1) and E[x(i+ 1)x(i)] = E[x(i)x(i— 1)]. For this 
situation the abbreviated notation 0% = 02(0o) is used. Furthermore, 
rą denotes the autocorrelation between x(i) and x(i+ k). That is, 
E[x(i)x(i + k)] = Cov[x(i)x(i + k)] = or, (only valid in the steady state). 
See also Section 8.1.5 and Appendix C.2. 

Second order AR models are the time-discrete counterparts of second 
order differential equations describing physical processes that behave 
like a damped oscillator, e.g. a mass/spring/dampener system, a swinging 
pendulum, an electrical LCR-circuit, and so on. Figure 4.5 shows a 
realization of a second order AR process. 


Prediction 


Equation (4.13) is the basis for prediction. Suppose that at time 7 an 
unbiased estimate x(i) is known together with the associated error 
covariance C,(i). The best predicted value (MMSE) of the state for £ 
samples ahead of i is obtained by the recursive application of (4.13). 
The recursion starts with E[x(i)]}=x() and terminates when 
E[x(i + £)] is obtained. The covariance matrix C,(i) is a measure of 
the magnitudes of the random fluctuations of x(i) around x(i). As such 
it is also a measure of uncertainty. Therefore, the recursive usage of 
(4.13) applied to C,(i) gives C.(i+ £), i.e. the uncertainty of the 
prediction. With that, the recursive equations for the prediction 
become: 








K(i+04+1) =F + ARAL Auli + A 
Celi ++ 1) = F(i + ġOC.(i + OF (i j 











(4.22) 
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Figure 4.5 Second order autoregressive process 
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Example 4.2 Prediction of a swinging pendulum 

The mechanical system shown in Figure 4.6 is a pendulum whose pos- 
ition is described by the angle 6(¢) and the position of the hinge. The 
length R of the arm is constant. The mass 7 is concentrated at the end. 
The hinge moves randomly in the horizontal direction with an acceler- 
ation given by a(t). Newton’s law, applied to the geometrical set up, gives: 


ma(t) cos 0(t) + mRO(t) = -mg sin 0(t) — E Ge (4.23) 


k is a viscous friction constant; g is the gravitation constant. If the 
sampling period A, and max (|0|) is sufficiently small, the equation can 
be transformed to a second order AR process. The following state model, 
with x1 (i) = O(iA) and x2(i) = 6(iA), is equivalent to that AR process: 


x1 (i + 1) = x1 (4) + Ax? (i) 


alit D= ml) 2 (onl) +E) +019) 





(4.24) 


Figure 4.7(a) shows the result of a so-called fixed interval prediction. 
The prediction is performed from a fixed point in time (i is fixed), and 
with a running lead, that is £ = 1,2,3,.... In Figure 4.7(a), the fixed 
point is i ~ 10(s). Assuming that for that i the state is fully known, 
x(i) = x(i) and C,(i) = 0, predictions for the next states are calculated 
and plotted. It can be seen that the prediction error increases with the 
lead. For larger leads, the prediction covariance matrix approaches 
the state covariance matrix, i.e. C.(0o) = Cx(oo). 

Figure 4.7(b) shows the results from fixed lead prediction. Here, the 
recursions are reinitiated for each i. The lead is fixed and chosen such 
that the relative prediction error is 36%. 


R=1.5 (m) 
g=9.8 (m/s?) 
k=0.2 (m?/s) 
A=0.01(s) 


,=1 (m/s?) 





Figure 4.6 A swinging pendulum 
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Figure 4.7 Prediction. (a) Fixed interval prediction. (b) Fixed lead prediction 


Linear-Gaussian measurement models 


A linear measurement model takes the following form: 
z(i) = H(i)x(i) + v(i) (4.25) 


H(i) is the so-called measurement matrix. It is an N x M matrix. v(i) is 
the measurement noise. It is a sequence of random vectors of dimension 
N. Obviously, the measurement noise represents the noise sources in the 
sensory system. Examples are thermal noise in a sensor and the quant- 
ization errors of an AD converter. 

The general assumption is that the measurement noise is a zero mean, 
white random sequence with normal distribution. In addition, the 
sequence is supposed to have no correlation between the measurement 
noise and the process noise: 


E[v(i)] = 0 
Ef TO] = CSG) (4.26) 
E[v@w"(j)] =0 


C,(i) is the covariance matrix of v(i). C,(i) specifies the density of v(i) 
in full. 
The sensory system is time invariant if neither H nor C, depends on i. 
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The discrete Kalman filter 


The concepts developed in the previous section are sufficient to trans- 
form the general scheme presented in Section 4.1 into a practical solu- 
tion. In order to develop the estimator, first the initial condition valid for 
i = 0 must be established. In the general case, this condition is defined in 
terms of the probability density p(x(0)) for x(0). Assuming a normal 
distribution for x(0) it suffices to specify only the expectation E[x(0)] 
and the covariance matrix C,(0). Hence, the assumption is that these 
parameters are available. If not, we can set E[x(0)] = 0 and let C,(0) 
approach to infinity, i.e. C,(0) — ool. Such a large covariance matrix 
represents the lack of prior knowledge. 


The next step is to establish the posterior density p(x(0)|z(0)) from 
which the optimal estimate for x(0) follows. At this point, we ente the 
loop of Figure 4.2. Hence, we calculate the density p(x(1)|z(0)) of the 


next state, ae a the measurement z(1) garde in the oe 
density p(x(1)|z(0), z(1)) = p(x(1)|Z(1)). From that, the optimal estimate 
for x(1) He This procedure has to be iterated for all the next time 
cycles. 

The representation of all the densities that are involved can be given 
in terms of expectations and covariances. The reason is that any linear 
combination of Gaussian random vectors yields a a is also 
Gaussian. Therefore, both p(x(i+1)|Z(i)) and p((x(i)|Z(i)) are fully 
represented by their expectations and covariances. In sie 4 discrim- 
inate between the two situations a new notation is needed. From 
now on, the conditional expectation E[x(i)|Z(j)] will be denoted by 

x(i|j). It is the expectation associated aa . conditional density 

x(i)|Z(7)). The covariance matrix associated with this density is 
denoted by C(iļj). 

The update, i.e. the determination of p((x(i)|Z(i)) given p(x(i)| 
Z(i—1)), follows from Section 3.1.5 where it has been shown that the 
unbiased linear MMSE estimate in the linear-Gaussian case equals the 
MMSE estimate, and that this estimate is the conditional expectation. 
Application of (3.33) and (3.45) to (4.25) and (4.26) gives: 


2(i) = H(z)x(z|7 — 1) 
S(i) = H(a)C(i|i — 1)HT (i) + C, (a) 
K(i (7) (4.27) 
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The interpretation is as follows. z(i) is the predicted measurement. It is an 
unbiased estimate of z(i) using all information from the past. The 
so-called innovation matrix S(i) represents the uncertainty of the predicted 
measurement. The uncertainty is due to two factors: the uncertainty of 
x(i) as expressed by C(i|i— 1), and the uncertainty due to the measure- 
ment noise v(i) as expressed by C,(i). The matrix K(i) is the Kalman gain 
matrix. This matrix has large, when S(i) is small and C(i|i — 1)H"(i) is 
large, that is, when the measurements are relatively accurate. When this is 
the case, the values in the error covariance matrix C(i|i) will be much 
smaller than C(i|i — 1). 

The prediction, i.e. the determination of p(x(i+1)|Z(i)) given 
p((x(i)|Z(z)), boils down to finding out how the expectation X(ji|i) and 
the covariance matrix C(i|i) propagate to the next state. Using (4.11) and 
(4.13) we have: 


X(i + 1|é) = FORGA + Lul) 


eer Rete (4.28) 
C(i + 1i) = F(a)C(ali)F° (i) + Cy (a) 





At this point, we increment the counter, and x(i+ 1|i) and C(i+ 1ļż) 
become X(i|i — 1) and C(i|i — 1). These recursive equations are generally 
referred to as the discrete Kalman filter (DKF). 

In the Gaussian case, it does not matter much which optimality 
criterion we select. MMSE estimation, MMAE estimation and MAP 
estimation yield the same result, i.e. the conditional mean. Hence, the 
final estimate is found as x(i) = X(i|i). It is an absolute unbiased estimate, 
and its covariance matrix is C(i|i). Therefore, this matrix is often called 
the error covariance matrix. 

In the time invariant case, and assuming that the Kalman filter is 
stable, the error covariance matrix converges to a constant matrix. In 
that case, the innovation matrix and the Kalman gain matrix become 
constant as well. The filter is said to be in the steady state. The steady 
state condition simply implies that C(i+ 1|i) = C(i|i — 1). If the notation 
for this matrix is shortened to P, then (4.28) and (4.27) lead to the 
following equation: 


P = FPF’ + Cw — FPH? (HPH! + C,) ‘HPF7 (4.29) 
The equation is known as the discrete algebraic Ricatti equation. 


Usually, it is solved numerically. Its solution implicitly defines the steady 
state solution for S, K and C. 
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Example 4.3 Application to the swinging pendulum 

In this example we reconsider the mechanical system shown in 
Figure 4.6, and described in Example 4.2. Suppose a gyroscope meas- 
ures the angular speed 6(t) at regular intervals of 0.4. The discrete 
model in Example 4.2 uses a sampling period of A = 0.01s. We could 
increase the sampling period to 0.4s in order to match it with the 
sampling period of the measurements, but then the applied discrete 
approximation would be poor. Instead, we model the measurements 
with a time variant model: z(7) = H(i)x(i) + v(i) where both H(i) and 
C,(i) are always zero except for those i that are multiples of 40: 


»_ J [01] if mod(i,40) = 0 
b [i elsewhere oy 


The effect of such a measurement matrix is that during 39 consecutive 
cycles of the loop only predictions take place. During these cycles 
H(i) = 0, and consequently the Kalman gains are zero. The corres- 
ponding updates would have no effect, and can be skipped. Only 
during each 40th cycle H(i) 4 0, and a useful update takes place. 
Figure 4.8 shows measurements obtained from the swinging pendulum. 
The variance of the measurement noise is 0? = C,(i) = 0.1? (rad/s). 
The filter is initiated with x(0) = 0 and with C,(0) — oo. The figure also 
shows the second element of the Kalman gain matrix, e.g. K», (2) (the first 
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one is much smaller, but follows the same pattern). It can be seen that 
after about 3(s) the Kalman gain (and the error covariance matrix) 
remain constant. The filter has reached its steady state. 


4.2.2 Suboptimal solutions for nonlinear systems 


This section extends the discussion on state estimation to the more 
general case of nonlinear systems and nonlinear measurement functions: 


(4.31) 


The vector f(-,-,-) is a nonlinear, time variant function of the state x(i) 
and the control vector u(i). The control vector is a deterministic signal. 
Since u(i) is fully known, it only causes an explicit time dependency in 
f(-,-,-). Without loss of generality, the notation can be shortened to 
f(x(i),i), because such an explicit time dependency is already implied in 
that shorter notation. If no confusion can occur, the abbreviation f(x(i)) 
will be used instead of f(x(i),i) even if the system does depend on time. 
As before, w(i) is the process noise. It is modelled as zero mean, Gaussian 
white noise with covariance matrix C,,(i). The vector h(-,-) is a non- 
linear measurement function. Here too, if no confusion is possible, the 
abbreviation h(x(i)) will be used covering both the time variant and the 
time invariant case. v(i) represents the measurement noise, modelled as 
zero mean, Gaussian white noise with covariance matrix C,(i). 

Any Gaussian random vector that undergoes a linear operation retains its 
Gaussian distribution. A linear operator only affects the expectation and the 
covariance matrix of that vector. This property is the basis of the Kalman 
filter. It is applicable to linear-Gaussian systems, and it permits a solution 
that is entirely expressed in terms of expectations and covariance matrices. 
However, the property does not hold for nonlinear operations. In nonlinear 
systems, the state vectors and the measurement vectors are not Gaussian 
distributed, even though the process noise and the measurement noise might 
be. Consequently, the expectation and the covariance matrix do not fully 
specify the probability density of the state vector. The question is then how 
to determine this non-Gaussian density, and how to represent it in an 
economical way. Unfortunately, no general answer exists to this question. 

This section seeks the answer by assuming that the nonlinearities of the 
system are smooth enough to allow linear or quadratic approximations. 
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Using these approximations, Kalman-like filters become within reach. 
These solutions are suboptimal since there is no guarantee that the 
approximations are close. 

An obvious way to get the approximations is by application of a 
Taylor series expansion of the functions. Ignorance of the higher order 
terms of the expansion gives the desired approximation. The Taylor 
series exists by virtue of the assumed smoothness of the nonlinear func- 
tion; it does not work out if the nonlinearity is a discontinuity, i.e. 
saturation, dead zone, hysteresis, and so on. The Taylor series expan- 
sions of the system equations are as follows: 


M-1 
f(x +e) = f(x) + F(x)e + > NC eme F} (x)e + HOT 
Ties (4.32) 





N-1 
h(x + £) = h(x) + H(x)e + 5 S > ee HY, (x)e + HOT 
n=0 


em is the Cartesian basis vector with appropriate dimension. The m-th 
element of e, is one; the other are zeros. e„ can be used to select the m-th 
element of a vector: ef x = xmn. F(x) and H(x) are Jacobian matrices. F” (x) 
and H? (x) are Hessian matrices. These matrices are defined in Appendix 
B.4. HOT are the higher order terms. The quadratic approximation arises 
if the higher order terms are ignored. If, in addition, the quadratic term is 


ignored, the approximation becomes linear, e.g. f(x + €) S f(x) + F(x)e. 


The linearized Kalman filter 


The simplest approximation occurs when the system equations are lin- 
earized around some fixed value x of x(i). This approximation is useful if 
the system is time invariant and stable, and if the states swing around an 
equilibrium state. Such a state is the solution of: 


x = f(x) (4.33) 
Defining e(i)=x(i)—x, the linear approximation of the state 


equation (4.31) becomes x(i+ 1) & f(x) + F(x)e(i) + w(i). After some 
manipulations: 


(4.34) 
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By interpreting the term (I — F(x))x as a constant control input, and 
by compensating the offset term h(x) — H(x)x in the measurement 
vector, these equations become equivalent to (4.11) and (4.25). This 
allows the direct application of the DKF as given in (4.28) and (4.27). 

Many practical implementations of the discrete Kalman filter are 
inherently linearized Kalman filters because physical processes are sel- 
dom exactly linear, and often a linear model is only an approximation of 
the real process. 


Example 4.4 The swinging pendulum 
The swinging pendulum from Example 4.2 is described by (4.23): 


ma(t) cos 6(t) + mRO(t) = —mg sin 0(t) — mk at) 


This equation is transformed in the linear model given in (4.24). 
In fact, this is a linearized model derived from: 


xili + 1) = x1 (i) + Ax2(i) 


xli + 1) = x2(i) — E (esinai( + E vali + a(t) cos (i) 





(4.35) 


The equilibrium for a(i)=0 is x =0. The linearized model is 
obtained by equating sin x; (i) S xı (i) and cosx (i) S 1. 


Example 4.5 A linearized model for volume density estimation 

In Section 4.1.1 we introduced the nonlinear, non-Gaussian problem 
of the volume density estimation of a mix in the process industry 
(Example 4.1). The model included a discrete state variable to 
describe the on/off regulation of the input flow. We will now replace 
this model by a linear feedback mechanism: 





V(i+1) ~ V(i) +A(a(Vo VÒ) +w1(i) + fo 
4+ wr(i) —c,/V(i) / Vref) 
D(i+1)~ D(i) 


(a( Vo = V(i)) +01 (i))D@ — (P +02) (1 - DE) 
= va 





(4.36) 
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The liquid input flow has now been modelled by fı (i) = a(Vo — V(i))+ 
w (i). The constant Vo = Vef + (c — f)/a (with V,.¢ =t 
(Vaieh + Viow)) realizes the correct mean value of V(i). The random part 
wy(i) of fı (i) establishes a first order AR model of V (i) which is used as a 
rough approximation of the randomness of fı (i). The substance input 
flow f2(i), which in Example 4.1 appears as chunks at some discrete points 
of time, is now modelled by f, + w2(i), i.e. a continuous flow with some 
randomness. 

The equilibrium is found as the solution of V(i+ 1) = V(i) and 
D(i+1) = D(i). The results are: 


a o id. Poa Var) oan 


The expressions for the Jacobian matrices are: 





1- 0 


A{ a+ — 
2.) Wr 








F(x) = = E 
aD f,(1—D)—a(Vo—V)D fa +a(Vo — V) 
a( V - y2 ) Sa V 
(4.38) 
A A 
G(x)= ea = -RD R 1-D (4.39) 
V V 








The considered measurement system consists of two sensors: 


e A level sensor that measures the volume V(i) of the barrel. 
e A radiation sensor that measures the density D(i) of the output flow. 


The latter uses the radiation of some source, e.g. X-rays, that 
is absorbed by the fluid. According to Beer-Lambert’s law, Pout = 
Pin exp (—wD(i)) where u is a constant depending on the path length 
of the ray and on the material. Using an optical detector the 
measurement function becomes z= Uexp(—pD)+v with U a 
constant voltage. With that, the model of the two sensors becomes: 


V(i) + v(i) 
z(i) = h(x(4)) + v(i) = E we (4.40) 
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with the Jacobian matrix: 





1 0 
= 4.41 

Be) exp =u) AG 
The best fitted parameters of this model are as follows: 
Volume control Substance flow Output flow Measurement system 
A=1 (5) fa = 0.1 (lit/s) Vef = 4000 (lit) o, = 16 (lit) 
Vo = 4001 (lit) ow, =0.9 (lit/s) c= 1 (lit/s) U = 1000 (V) 
a = 0.95 (1/5) u= 100 
ow, = 0.1225 (lit/s) oy, = 0.02 (V) 


Figure 4.9 shows the real states (obtained from a simulation using the 
model from Example 4.1), observed measurements, estimated states 
and estimation errors. It can be seen that: 


e The density can only be estimated if the real density is close to the 
equilibrium. In every other region, the linearization of the measure- 
ment is not accurate enough. 

e The estimator is able to estimate the mean volume, but cannot keep 
track of the fluctuations. The estimation error of the volume is 
much larger than indicated by the 1o boundaries (obtained from 
the error covariance matrix). The reason for the inconsistent beha- 
viour is that the linear-Gaussian AR model does not fit well enough. 


volume (litre) real (thin) and estimated volume (litre) 


4020 4020 


volume error (litre) 





density 20 
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02 i fh, r 0.02 An i 
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Figure 4.9 Linearized Kalman filtering applied to the volume density estimation 
problem 
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The extended Kalman filter 


A straightforward generalization of the linearized Kalman filter occurs 
when the equilibrium point x is replaced with a nominal trajectory 
x(i), recursively defined as: 


x(i+ 1) = f(x(é)) with x(0) = E[x(0)] (4.42) 


Although the approach is suitable for time variant systems, it is not often 
used. There is another approach with almost the same computational 
complexity, but with better performance. That approach is the extended 
Kalman filter (EKF). 

Again, the intention is to keep track of the conditional expectation 
X(i|i) and the covariance matrix C(i|i). In the linear-Gaussian case, where 
all distributions are Gaussian, the conditional mean is identical to both 
the MMSE estimate (= minimum variance estimate), the MMAE esti- 
mate, and the MAP estimate; see Section 3.1.3. In the present case, the 
distributions are not necessarily Gaussian, and the solutions of the three 
estimators do not coincide. The extended Kalman filter provides only an 
approximation of the MMSE estimate. 

Each cycle of the extended Kalman filter consists of a ‘one step ahead’ 
prediction and an update, as before. However, the tasks are much more 
difficult now, because the calculation of, for instance, the ‘one step 
ahead’ expectation: 


X(i+1|é) =E[x(i+ 1)|Z(A)] = | x(i+ 1)p(x(i+1)|Z(i))dx(i+1) (4.43) 


requires the probability density p(x(i+ 1)|Z(i)); see (4.8). But, as said 
before, it is not clear how to represent this density. The solution of the 
EKF is to apply a linear approximation of the system function. With 
that, the ‘one step ahead’ expectation can be expressed entirely in terms 
of the moments of p(x(i)|Z(i)). 

The EKF uses linear approximations of the system functions using the 
first two terms of the Taylor series expansions.* Suppose that at time i we 
have the updated estimate x(i) = x(i|i) and the associated approximate 





“With more terms in the Taylor series expansion, the approximation becomes more accurate. 
For instance, the second order extended Kalman filter uses quadratic approximations based on 
the first three terms of the Taylor series expansions. The discussion on the extensions of this 
type is beyond the scope of this book. See Bar-Shalom (1993) 
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error covariance matrix C(i|i). The word ‘approximate’ expresses the fact 
that our estimates are not guaranteed to be unbiased due to the linear 
approximations. However, we do assume that the influence of the errors 
induced by the linear approximations is small. If the estimation error is 
denoted by e(i), then: 


(4.44) 


In these expressions, X(iJ) is our estimate.’ It is available at time i, and 
as such, it is deterministic. Only e(i) and w(i) are random. Taking the 
expectation on both sides of (4.44), we obtain approximate values for 
the ‘one step ahead’ prediction: 


x(i + 12) 


IIe 


£(x(i|7)) 


(4.45) 
F(X(ilé))C(i|i)F” (X(i|i)) + Cwl) 





C(i + 12) 


IIe 


We have approximations instead of equalities for two reasons. First, 
we neglect the possible bias of x(i|i), i.e. a nonzero mean of e(i). 
Second, we ignore the higher order terms of the Taylor series 
expansion. 

Upon incrementing the counter, x(i+ 1|i) becomes x(i|i— 1), and 
we now have to update the prediction X(iļi— 1) by using a new 
measurement z(/) in order to get an approximation of the conditional 
mean X(i|i). First we calculate the predicted measurement z(i) based 
on X(i|i—1) using a linear approximation of the measurement 
function, that is h(x —e) = h(x) -H(&)e. Next, we calculate the 
innovation matrix S(i) using that same approximation. Then, we 
apply the update according to the same equations as in the linear- 
Gaussian case. 





> Up to this point x(i|/) has been the exact expectation of x(i) given all measurements up to z(i). 
From now on, in this section, x(i|i) will denote an approximation of that. 
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The predicted measurements are: 


( 
| 
[h(x(2)) + v(A)|x(i|i — 1) (4.46) 
[h(l 1)) — H(i- 1))elli— 1) + v) 

(x(li 


The approximation is based on the assumption that E[e(é|i — 1)] = 0, 
and on the Taylor series expansion of h(-). The innovation matrix 
becomes: 


S(i) = H&(ili — 1) Chili- DHT (ili — 1)) + C,(i) (4.47) 


From this point on, the update continues as in the linear-Gaussian case; 
see (4.27): 


i)) (4.48) 


Despite the similarity of the last equation with respect to the linear case, 
there is an important difference. In the linear case, the Kalman gain K(?) 
depends solely on deterministic parameters: H(i), F(i), Cy(i), Cy(i) and 
C,(0). It does not depend on the data. Therefore, K(i) is fully determin- 
istic. It could be calculated in advance instead of online. In the EKF, the 
gains depend upon the estimated states x(i|i— 1) through H(x(i|i — 1)), 
and thus also upon the measurements z(i). As such, the Kalman gains are 
random matrices. Two runs of the extended Kalman filter in two 
repeated experiments lead to two different sequences of the Kalman 
gains. In fact, this randomness of K(i) can cause instable behaviour. 


Example 4.6 The extended Kalman filter for volume density 
estimation 

Application of the EKF to the density estimation problem introduced 
in Example 4.1 and represented by a linear-Gaussian model in 
Example 4.5 gives the results as shown in Figure 4.10. Compared 
with the results of the linearized KF (Figure 4.9) the density errors are 
now much better consistent with the 1o boundaries obtained from the 
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Figure 4.10 Extended Kalman filtering for the volume density estimation problem 


error covariance matrix. However, the EKF is still not able to cope 
with the non-Gaussian disturbances of the volume. 

Note also that the 1¢ boundaries do not reach a steady state. The 
filter remains time variant, even in the long term. 


The iterated extended Kalman filter 


A further improvement of the update step in the extended Kalman filter 
is within reach if the current estimate X(i|i) is used to get an improved 
linear approximation of the measurement function yielding an improved 
predicted measurement 2(i). In turn, such an improved predicted meas- 
urement can improve the current estimate. This suggests an iterative 
approach. 

Let z:(i) be the predicted measurement in the ¢-th iteration, and let 
X/(i) be the ¢-th improvement of X(i|i). The iteration is initiated with 
Xo(i) = X(i|i — 1). A naive approach for the calculation of X;,1(i) simply 
uses a relinearization of h(-) based on X,(i). 


He = H(X;(4)) 
Se = Hei Clili — 1)H7 1 + Cr(A) 
Key1 = C(iji — 1)H7, SpA 


X41 (i) = x(ili = 1) + Ka (z(i) a h(x(ii — 1))) 


oS 


(4.49) 





oS 


Hopefully, the sequence X;(i), with £ = 0,1,2,..., converges to a final 
solution. 
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A better approach is the so-called iterated extended Kalman filter 
(IEKF). Here, the approximation is made that both the predicted state 
and the measurement noise are normally distributed. With that, the 
posterior probability density (4.9) 


p(x()|Z(1)) = EO — 1))p(2@|x@) (4.50) 


being the product of two Gaussians, is also a Gaussian. Thus, the MMSE 
estimate coincides with the MAP estimate, and the task is now to find 
the maximum of p(x(i)|Z(i)). Equivalently, we maximize its logarithm. 
After the elimination of the irrelevant constants and factors, it all boils 
down to minimizing the following function w.r.t. x: 


(x — 3p)" C," (x — Xp) +5 -h(x)) C, (z= h(x)) (4.51) 


comes from p(x(i)|Z(i—1)) comes from p(z(i)|x(i)) 


N| =e 


fl) = 


For brevity, the following notation has been used: 


Xp = X(ili — 1) 

Cp = C(iji — 1) 
z = z(i) 

C, = Cy(i) 


The strategy to find the minimum is to use Newton—Raphson iteration 
starting from Xo = X(i|i— 1). In the ¢-th iteration step, we have already 
an estimate X;_; obtained from the previous step. We expand f(x) in a 
second order Taylor series approximation: 


fle) = Fea) + (x — ea)? PRED 
1 F(x (4.52) 
+5(x— Kea)" a (x = Xe) 


where Of/Ox is the gradient and 0°f/Ox? is the Hessian of f(x). See 
Appendix B.4. The estimate X, is the minimum of the approximation. 
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It is found by equating the gradient of the approximation to zero. 
Differentiation of (4.52) w.r.t. x gives: 








Ox sO ee) =o 
4 (4.53) 
k fs) Of (Xe-1) 
SETE NS Ox2 Ox 


The Jacobian and Hessian of (4.51), in explicit form, are: 


Of (Xe-1) 
ox 

OF (Ke1) at faci 

aa C, + H; C; H; 


= Cot (Ze = Xp) = HC) (z = h(X)_1)) 
(4.54) 


where Hy = H(X;_1) is the Jacobian matrix of h(x) evaluated at X;_1. 
Substitution of (4.54) in (4.53) yields the following iteration scheme: 


-1 
Xe = Xi (c + H/C; 'H;) |o; (Xı-1 — Xp) 


(4.55) 
— H/C, "(2 — h(&-1))] 


The result after one iteration, i.e. X;(i), is identical to the ordinary 
extended Kalman filter. The required number of further iterations 
depends on how fast x;(i) converges. Convergence is not guaranteed, 
but if the algorithm converges, usually a small number of iterations 
suffices. Therefore, it is common practice to fix the number of iterations 
to some practical number L. The final result is set to the last iteration, 
i.e. X(iļi) = XL. 

Equation (3.44) shows that the factor (C5' + H/C, 1H,)~! is the error 
covariance matrix associated with x(i|i): 


-1 
C(ili) = (c ze HI, 1C Hr) (4.56) 


This insight gives another connotation to the last term in (4.55) because, 
in fact, the term C(i|i)H/ C7" can be regarded as the Kalman gain matrix 
K, during the 4-th iteration; see (3.20). 
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Example 4.7 The iterated EKF for volume density estimation 

In the previous example, the EKF was applied to the density estima- 
tion problem introduced in Example 4.1. The filter was initiated with 
the equilibrium state as prior knowledge, i.e. E[x(0)] =x = 
[4000 0.1]". Figure 4.11(b) shows the transient which occurs if the 
EKF is initiated with E[x(0)] = [2000 0]. It takes about 40 (s) before 
the estimated density reaches the true densities. This slow transient is 
due to the fact that in the beginning the linearization is poor. The 
iterated EKF is of much help here. Figure 4.11(c) shows the results. 
From the first measurement on the estimated density is close to the 
real density. There is no transient. 


The extended Kalman filter is widely used because for a long period of 
time no viable alternative solution existed. Nevertheless, it has numerous 
disadvantages: 


e It only works well if the various random vectors are approximately 
Gaussian distributed. For complicated densities, the expectation- 
covariance representation does not suffice. 

e It only works well if the nonlinearities of the system are not too 
severe because otherwise the Taylor series approximations fail. 
Discontinuities are deadly for the EKF’s proper functioning. 

e Recalculating the Jacobian matrices at every time step is computa- 
tionally expensive. 


(a) (b) (c) 
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Figure 4.11 Iterated extended Kalman filtering for the volume density estimation 
problem. (a) Measurements (b) Results from the EKF (c) Results from the iterated 
EKF (no. of iterations = 20) 
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e In some applications, it is too difficult to find the Jacobian matrix 
analytically. In these cases, numerical approximations of the 
Jacobian matrix are needed. However, this introduces other types 
of problems because now the influence of having approximations 
rather than the true values comes in. 

e Inthe EKF, the Kalman gain matrix depends on the data. With that, 
the stability of the filter is not assured anymore. Moreover, it is very 
hard to analyse the behaviour of the filter. 

e The EKF does not guarantee unbiased estimates. In addition, the 
calculated error covariance matrices do not necessarily represent 
the true error covariances. The analysis of these effects is also hard. 


4.2.3 Other filters for nonlinear systems 


Besides the extended Kalman filter there are many more types of esti- 
mators for nonlinear systems. Particle filtering is a relatively new 
approach for the implementation of the scheme depicted in Figure 4.2. 
The discussion about particle filtering will be deferred to Section 4.4 
because it not only applies to continuous states. Particle filtering is 
generally applicable; it covers the nonlinear, non-Gaussian continuous 
systems, but also discrete systems and mixed systems. 

Statistical linearization is a method comparable with the extended 
Kalman filter. But, instead of using a truncated Taylor series approxi- 
mation for the nonlinear system functions, a linear approximation 
f(x + £) S f(x) + Fe is used such that the deviation f(x + £) — f(x) — Fe 
is minimized according to a statistical criterion. For instance, one could 
try to determine F such that E|||f(x + £) — f(x) — Fe||"| is minimal. 

Another method is the unscented Kalman filter. This is a filter midway 
between the extended Kalman filter and the particle filter. Assuming 
Gaussian densities for x (as in the Kalman filter), the expectation and the 
error covariance matrix is represented by means of a number of samples 
x'*), that are used to calculate the effects of a nonlinear system function 
on the expectation and the error covariance matrix. Unlike the particle 
filter, these samples are not randomly selected. Instead the filter uses a 
small amount of samples that are carefully selected and that uniquely 
represent the covariance matrix. The transformed points, i.e. f(x'*)) are 
used to reconstruct the covariance matrix of f(x). Such a reconstruction 
is much more accurate than the approximation that is obtained by 
means of the truncated Taylor series expansion. 
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4.3 DISCRETE STATE VARIABLES 


We consider physical processes that are described at any time as being in 
one of a finite number of states. Examples of such processes are: 


e The sequence of strokes of a tennis player during a game, e.g. 
service, backhand-volley, smash, etc. 

The sequence of actions that a tennis player performs during a 
particular stroke. 

e The different types of manoeuvres of an airplane, e.g. a linear flight, 
a turn, a nose dive, etc. 

The sequence of characters in a word, and the sequence of words in 
a sentence. 

e The sequence of tones of a melody as part of a musical piece. 

e The emotional modes of a person: angry, happy, astonished, etc. 


These situations are described by a state variable x(i) that can only take a 
value from a finite set of states Q = {wu ,..., wx}. 

The task is to determine the sequence of states that a particular process 
goes through (or has gone through). For that purpose, at any time meas- 
urements z(i) are available. Often, the output of the sensors is real- 
valued. But nevertheless we will assume that the measurements take their 
values from a finite set. Thus, some sort of discretization must take place 
that maps the range of the sensor data onto a finite set Z = {1,..., Un}. 

This section first introduces a state space model that is often used for 
discrete state variables, i.e. the hidden Markov model. This model will be 
used in the next subsections for online and offline estimation of the states. 


4.3.1 Hidden Markov models 


A hidden Markov model (HMM) is an instance of the state space model 
discussed in Section 4.1.1. It describes a sequence x(i) of discrete states 
starting at time i = 0. The sequence is observed by means of measure- 
ments z(i) that can only take values from a finite set. The model consists 
of the following ingredients: 


e The set Q containing the K states wą that x(i) can take. 
e The set Z containing the N symbols V, that z(i) can take. 
e The initial state probability Po(x(0)). 
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e The state transition probability P,(x(i)|x(i — 1)). 
e The observation probability P,(z(i)|x(i)). 


The expression Po(x(0)) with x(0) € {1,...,K} denotes the probability 
that the random state variable x(0) takes the value wx(9). Thus 
Po(k) def Po(x(0) = wy). Similar conventions hold for other expressions, 
like P;(x(i)|x(i — 1)) and P,(z(i)|x(i)). 

The Markov condition of an HMM states that 
P(x(i)|x(0),...,x(i — 1)), i.e. the probability of x(i) under the condition 
of all previous states, equals the transition probability. The assumption 
of the validity of the Markov condition leads to a simple, yet powerful 
model. Another assumption of the HMM is that the measurements are 
memoryless. In other words, z(i) only depends on x(i) and not on the 
states at other time points: P(z(j)|x(0),...,x(é)) = P(z(j)|x(/)). 

An ergodic Markov model is one for which the observation of a single 
sequence x(0),x(1),...,x(co) suffices to determine all the state transition 
probabilities. A suitable technique for that is histogramming, i.e. the deter- 
mination of the relative frequency with which a transition occurs; see 
Section 5.2.5. A sufficient condition for ergodicity is that all state prob- 
abilities are nonzero. In that case, all states are reachable from everywhere 
within one time step. Figure 4.12 is an illustration of an ergodic model. 

Another type is the so-called left-right model. See Figure 4.13. This 
model has the property that the state index k of a sequence is non- 
decreasing as time proceeds. Such is the case when P,(k|é) = 0 for all 
k < £. In addition, the sequence always starts with wı and terminates 





Figure 4.12 A three-state ergodic Markov model 
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Figure 4.13 A four-state left-right model 


with wx. Thus, Po(k) = 5(k,1) and P,(k|K) = 6(k,K). Sometimes, an 
additional constraint is that large jumps are forbidden. Such a constraint 
is enforced by letting P,(k|é) = 0 for all k > £+ A. Left-right models 
find applications in processes where the sequence of states must obey 
some ordering over time. An example is the stroke of a tennis player. For 
instance, the service of the player follows a sequence like: ‘take position 
behind the base line’, ‘bring racket over the shoulder behind the back’, 
‘bring up the ball with the left arm’, etc. 

In a hidden Markov model the state variable x(i) is observable only 
through its measurements z(i). Now, suppose that a sequence Z(i) = 
{z(0), z(1), ..., 2(i)} of measurements has been observed. Some applications 
require the numerical evaluation of the probability P(Z(i)) of particular 
sequence. An example is the recognition of a stroke of a tennis player. 
We can model each type of stroke by an HMM that is specific for that type, 
thus having as many HMMs as there are types of strokes. In order to 
recognize the stroke, we calculate for each type of stroke the probability 
P(Z(i)|type of stroke) and select the one with maximum probability. 

For a given HMM, and a fixed sequence Z(i) of acquired measurements, 
P(Z(i)) can be calculated by using the joint probability of having the meas- 
urements Z(i) together with a specific sequence of state variables, i.e. X(i) = 
{x(0),x(1),...,x(i)}. First we calculate the joint probability P(X(i), Z(i)): 


P(X), Z0) = P(Z()|X@)) P(X) 
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Here, use has been made of the assumption that each measurement 2(?) 
only depends on x(i). P(Z(i)) follows from summation over all possible 
state sequences: 


P(Z(i)) = X P(X, Z0) (4.58) 
all X(i) 


Since there are K’ different state sequences, the direct implementation of 
(4.57) and (4.58) requires on the order of (i+ 1)K‘+! operations. Even 
for modest values of i, the number of operations is already impractical. 

A more economical approach is to calculate P(Z(i)) by means of a 
recursion. For that, consider the probability P(Z(i), x(i)). This probabil- 
ity can be calculated from the previous time step i — 1 using the follow- 
ing expression: 


K 
P(Z(i),x(i)) = p P(Z(i),x(i),x(i— 1)) 


x(i—1)=1 
K 
= Š P(a(i),x(i|ZG-1),x(i-1))P(Z(i— 1), x(é- 1)) 
x(i-1)=1 
K 
= So PRO xli- 1))P(ZG- 1),x(- 1)) 
x(i-1)=1 





ye P,(x(i)|x(i — 1))P(Z(i— 1),x(i— 1)) 
(4.59) 


The recursion must be initiated with P(z(0), x(0)) = Po(x(0))P.(z(0)|x(0)). 
The probability P(Z(i)) can be retrieved from P(Z(i), x(i)) by: 


K 
= 5 P(Z(i), x(i)) (4.60) 


x(i)=1 


The so-called forward algorithm® uses the array F(i, x(i)) = P(Z(i), x(i)) 
to implement the recursion: 





é The adjective ‘forward’ refers to the fact that the algorithm proceeds forwards in time. Section 
4.3.3 introduces the backward algorithm. 
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Algorithm 4.1: The forward algorithm 


1. Initialization: 
F(0, x(0) = Po(x(0))P,(z(0)|x(0)) for x(0)=1,...,K 


2. Recursion: 
for i= 1,2,3,4,... 


e for x(i)=1,...,K 


K 
Fi, x(i)) = Pel) SJ FG- 1,x@—1))P:(x@)|x(i— 1) 


x(i-1)=1 


In each recursion step, the sum consists of K terms, and the number of 
possible values of x(i) is also K. Therefore, such a step requires on the 
order of K? calculations. The computational complexity for i time steps 
is on the order of (i + 1)K?. 


4.3.2 Online state estimation 


We now focus our attention on the situation of having a single HMM, 
where the sequence of measurements is processed online so as to obtain 
real-time estimates x(i|/) of the states. This problem completely fits within 
the framework of Section 4.1. As such, the solution provided by (4.8) and 
(4.9) is valid, albeit that the integrals must be replaced by summations. 

However, in line with the previous section, an alternative solution will 
be presented that is equivalent to the one of Section 4.1. The alternative 
solution is obtained by deduction of the posterior probability: 


P(x(i)|Z(z)) = PZO) ae) (4.61) 


In view of the fact that Z(i) are the acquired measurements (and as such 
known and fixed) the maximization of P(x(i)|Z(i)) is equivalent to the 
maximization of P(Z(i), x(i)). Therefore, the MAP estimate is found as: 


Xmap(ili) = are maxi PZU (4.62) 
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The probability P(Z(i), x(i)) follows from the forward algorithm. 


Example 4.8 Online license plate detection in videos 

This example demonstrates the ability of HMMs to find the license plate 
of a vehicle in a video. Figure 4.14 is a typical example of one frame of 
such a video. The task is to find all the pixels that correspond to the license 
plate. Such a task is the first step in a license plate recognition system. 

A major characteristic of video is that a frame is scanned line- 
by-line, and that each video line is acquired from left to right. The 
real-time processing of each line individually is preferable because 
the throughput requirement of the application is demanding. Therefore, 
each line is individually modelled as an HMM. The hidden state of a 
pixel is determined by whether the pixel corresponds to a license plate 
or not. 

The measurements are embedded in the video line. See Figure 4.15. 
However, the video signal needs to be processed in order to map it 
onto a finite measurement space. Simply quantizing the signal to a 
finite number of levels does not suffice because the amplitudes of the 
signal alone are not very informative. The main characteristic of a 
license plate in a video line is a typical pattern of dark-bright and 
bright-dark transitions due to the dark characters against a bright 
background, or vice versa. The image acquisition is such that the 
camera—object distance is about constant for all vehicles. Therefore, 
the statistical properties of the succession of transitions are typical for 
the imaged license plate regardless of the type of vehicle. 





Figure 4.14 License plate detection 
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Figure 4.15 Definitions of the measurements associated with a video line 


One possibility to decode the succession of transitions is to apply 
a filter bank and to threshold the output of each filter, thus yielding a 
set of binary signals. In Figure 4.15, three high-pass filters have been 
applied with three different cut-off frequencies. Using high-pass filters 
has the advantage that the thresholds can be zero. As such, the results 
do not depend on the contrast and brightness of the image. The three 
binary signals define a measurement signal z(i) consisting of N = 8 
symbols. Figure 4.16 shows these symbols for one video line. Here, 
the symbols are encoded as integers from 1 up to 8. 

Due to the spatial context of the three binary signals we cannot 
model the measurements as memoryless symbols. The trick to avoid 


true license 
plate pixels: 








measurements: 








Figure 4.16 States and measurements of a video line 


120 STATE ESTIMATION 


this problem is to embed the measurement (i) in the state variable 
x(i). This can be done by encoding the state variable as integers from 1 
up to 16. If i is not a license plate pixel, we define the state as 
x(i) = z(i). If 7 is a license plate pixel, we define x(i) = z(i) + 8. With 
that, K = 16. Figure 4.16 shows these states for one video line. 

The embedding of the measurements in the state variables is a form 
of state augmentation. Originally, the number of states was 2, but after 
this particular state augmentation, the number becomes 16. The advan- 
tage of the augmentation is that the dependence, which does exist 
between any pair z(i), z(7) of measurements, is now properly modelled 
by means of the transition probability of the states. Yet, the model still 
meets all the requirements of an HMM. However, due to our definition 
of the state, the relation between state and measurement becomes 
deterministic. The observation probability degenerates into: 


1 ifn=kandk<8 
Pade) = fi ifn=k—8andk>8 


0 elsewhere 


In order to define the HMM, the probabilities Po(k) and P,(k|2) must 
be specified. We used a supervised learning procedure to estimate 
Po(k) and P,(k\¢). For that purpose, 30 images of 30 different vehicles, 
similar to the one in Figure 4.14, were used. For each image, the 
license plate area was manually indexed. Histogramming was used to 
estimate the probabilities. 

Application of the online estimation to the video line shown in 
Figures 4.15 and 4.16 yields results like those shown in Figure 4.17. 
The figure shows the posterior probability for having a license plate. 
According to our definition of the state, the posterior probability of 
having a license plate pixel is P(x(i) > 8|Z(i)). Since by definition 
online estimation is causal, the rise and decay of this probability 
shows a delay. Consequently, the estimated position of the license 
plate is biased towards the right. Figure 4.18 shows the detected 
license plate pixels. 


4.3.3 Offline state estimation 


In non-real-time applications the sequence of measurements can be 
buffered before the state estimation takes place. The advantage is that 
not only ‘past and present’ measurements can be used, but also ‘future’ 
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Figure 4.17 Online state estimation 











Figure 4.18 Detected license plate pixels using online estimation 


measurements. Exactly these measurements can prevent the delay that 
inherently occurs in online estimation. 

The problem is formulated as follows. Given a sequence Z(I) = 
{z(0),...,2(I)} of I+ 1 measurements of a given HMM, determine the 
optimal estimate of the sequence x(0),...,x(I) of the underlying states. 

Up to now, the adjective ‘optimal’ meant that we determined the 
individual posterior probability P(x(i)|measurements) for each time 
point individually, and that some cost function was applied to determine 
the estimate with the minimal risk. For instance, the adoption of a 
uniform cost function for each state leads to an estimate that maximizes 
the individual posterior probability. Such an estimate minimizes the 
probability of having an erroneous decision for such a state. 
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Minimizing the error probabilities of all individually estimated states 
does not imply that the sequence of states is estimated with minimal 
error probability. It might even occur that a sequence of ‘individually 
best’ rane states contains a forbidden transition, i.e. a transition for 
which P,(x()|x(i — 1)) = 0. In order to circumvent this, we need a criter- 
ion that a all states jointly. 

This section discusses two offline estimation methods. One is an 
‘individually best’ solution. The other is an ‘overall best’ solution. 


Individually most likely states 


Here, the strategy is to determine the n> probability P( i ie ), and 
then to determine the MAP estimate: X(i|I) = arg max P(x(i)|Z(J) aK said 
before, this method minimizes the error probabilities of the eae states. 
As such, it maximizes the expected number of correctly estimated states. 

Section 4.3.1 discussed the forward algorithm, a recursive algorithm 
for the calculation of the probability P(x(i), Z(i)). We now introduce the 
backward algorithm which calculates the probability P(z(i+1),..., 

I)|x(i)). During each recursion step of the algorithm, the probability 
P(z(j),...52(L)|x(j — 1)) is derived from P(z(j+1),...,Z(I)|x(j)). The 
recursion proceeds as follows: 


P(zQ7),---,2(D Ix — 1) 
K 
= SF PlxGlaG — 1)Pe(z() x) P(e + 1),---, ZIG) 


x(j)=1 


(4.63) 


The algorithm starts with j = I, and proceeds backwards in time, i.e. 
I, I— 1,1 —2,... until finally j = i + 1. In the first step, the expression 
P(z(I + 1)|x(I)) appears. Since that probability does not exist (because 
z(I + 1) is not available), it should be replaced by 1 to have the proper 
initialization. 

The availability of the forward and backward probabilities suffices for 
the calculation of the posterior probability: 


Pitz) = EOZ 
P(z(i+1),...,2(D|x(a), ZO P(x), Zi) 
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As said before, the individually most likely state is the one which maxi- 
mizes P(x(i)|Z(I)). The denominator of (4.64) is not relevant for this 
maximization since it does not depend on x/(i). 

The complete forward—backward algorithm is as follows: 


Algorithm 4.2: The forward—backward algorithm 


1. Perform the forward algorithm as given in Section 4.3.1, resulting in 
the array F(i,k) with i = 0,---,. and k =1,---,K 
2. Backward algorithm: 


e Initialization: 
BU,k)=1 for k=1,---,K 


e Recursion: 
fori=I-1,I1-2,---,0 and x(i)=1,---,K 


F(i, x(i)) = 3 P,(x(i)|x(i + 1))P2(2(i+ 1)|x(i + 1)) 
x(i+1)=1 
F(i + 1,x(i+ 1)) 


3. MAP estimation of the states: 


Xmap(i|l) = arg max{B(i, k)F(i, k)} 
k=1,...,K 


The forward—backward algorithm has a computational complexity that 
is on the order of (I + 1)K*. The algorithm is therefore feasible. 


The most likely state sequence 


A criterion that involves the whole sequence of states is the overall uni- 
form cost function. The function is zero when the whole sequence is 
estimated without any error. It is unit if one or more states are estimated 
erroneously. Application of this cost function within a Bayesian frame- 
work leads to a solution that maximizes the overall posterior probability: 


x(0),...,%(1) = arg max{P(x(0),...x(I)|Z(1))} (4.65) 
x(0),...,x(D) 
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The computation of this most likely state sequence is done efficiently by 
means of a recursion that proceeds forwards in time. The goal of this 
recursion is to keep track of the following subsequences: 


&(0),...,%(@-1) = argmax {P(x(0),...x(@—1),x(@)|Z(@))} (4-66) 


For each value of x(i), this formulation defines a particular partial 
sequence. Such a sequence is the most likely partial sequence from time 
zero and ending at a particular value x(i) at time i given the measure- 
ments z(0),...,2(é). Since x(i) can have K different values, there are 
K partial sequences for each value of i. Instead of using (4.66), we can 
equivalently use 


x(0),...,%(@-—1) = argmax {P(x(0),...x(i— 1), x(i), Z(i))} (4.67) 
x(0),...,x(i-1) 
because P(X(i)|Z(i)) = P(X(i), Z(i))P(Z(i)) and Z(i) is fixed. 

In each recursion step the maximal probability of the path ending in 
x(i) given Z(i) is transformed into the maximal probability of the path 
ending in x(i+ 1) given Z(i + 1). For that purpose, we use the following 
equality: 


P(x(0),...,x(é),x(é+ 1), Z(@+ 1)) 
= P(x(i + 1), 24+ 1)[x(0),..-, x), Z()) P(x(0),.--, (7), Z()) 
= P(z(i + 1)|x(i + 1D)P,(æli + 1)jæx(i))P(x(0),...,x(), Z) 
(4.68) 





Here, the Markov condition has been used together with the assumption 
that the measurements are memoryless. 
The maximization of the probability proceeds as follows: 
mae EU --+,x(i),x(i+1),Z(i+1))} 
= gman AA 1)|o(é+ 1))Pi(x + 1)|x()) P(x(0), ---,x(2), 24) 


= P,(z(i+1)|x(é+ Dim [Pti 1)|x(2)). 
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The value of x(i) that maximizes P(x(0),...,x(i),x(i+1),Z(i+1)) isa 
function of x(i+ 1): 


X(ilx(i+1)) = angmax Pr (xl + 1)|x(Z)) 


x(i) 
OX, PEO = DZ) 
(4.70) 


The so-called Viterbi algorithm uses the recursive equation in (4.69) and 
the corresponding optimal state dependency expressed in (4.70) to find 
the seas aa a that, we define the array O(i,x(i)) = 
MAaX4(0),...,x(1—-1) { P(x( ..,x(i = 1), x(i), Z(i))}. 


Algorithm 4.3: The Viterbi algorithm 


1. Initialization: 


2. Recursion: 


for i=2,---,I and “y= 1,- 
e Ol, x(7 = S OUS hae D e(x(i)lx(i — 1))) Pel) 


e Ri, x(i)) = mar (OU — 1,x(i — 1)P:(x(ġ|x(i — 1)))} 
3. Termination: 


e P= max{Q(I, x(1))} 


e X(I|I) = aie OSEN AUN 


4. Backtracking: 


for i=I—1,I-—2,.--,0 
e (ilI) = R(i+ 1,&(i+ 1,D) 
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The computational structure of the Viterbi algorithm is comparable to 


th 
th 


at of the forward algorithm. The computational complexity is also on 
e order of (i + 1)K?. 


Example 4.9 Offline license plate detection in videos 

Figure 4.19 shows the results of the two offline state estimators 
applied to the video line shown in Figure 4.15. Figure 4.20 provides 
the results of the whole image shown in Figure 4.14. 

Both methods are able to prevent the delay that is inherent in online 
estimation. Nevertheless, both methods show some falsely detected 
license plate pixels on the right side of the plate. These errors are 
caused by a sticker containing some text. Apparently, the statistical 
properties of the image of this sticker are similar to the one of a license 
plate. 

A comparison between the individually estimated states and the 
jointly estimated states shows that the latter are more coherent, and 
that the former are more fragmented. Clearly, such a fragmentation 
increases the probability of having erroneous transitions of estimated 
states. However, usual the resulting erroneous regions are small. The 
jointly estimated states do not show many of these unwanted transi- 
tions, but if they occur, then they are more serious because they result 
in a larger erroneous region. 


true license 
plate pixels: 


offline individually 


estimated states: 





detected pixels: pe 


offline, jointly 


detected pixels: 


estimated states: 








Figure 4.19 Offline state estimation 
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Figure 4.20 Detected license plate pixels using offline estimation. (a) Individually 
estimated states. (b) Jointly estimated states 


MATLAB functions for HMM 


The MATLAB functions for the analysis of hidden Markov models are 
found in the Statistics toolbox. There are five functions: 


hmmgenerate: 


hmmdecode: 


hmmestimate: 


Given P,(-|-) and P,(-|-), generate a sequence of states 
and observations. 

Given P,(-|-), Pz(-|-) and a sequence of observations, 
calculate the posterior probabilities of the states. 
Given a sequence of states and observations, estimate 
P,(-|-) and P,(-|:). 
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hmmtrain: Given a sequence of observations, estimate P;(-|-) 
and P,(-|-). 
hmmviterbi: Given P,(-|-), P.(-|-) and a sequence of observa- 


tions, calculate the most likely state sequence. 


The function hmmtrain() implements the so-called Baum—Welch 
algorithm. 


4.4 MIXED STATES AND THE PARTICLE FILTER 


Sections 4.2 and 4.3 focused on special cases of the general online estimation 
problem. The topic of Section 4.2 was continuous state estimation, and in 
particular the linear-Gaussian case or approximations of that. Section 4.3 
discussed discrete state estimation. We now return to the general scheme of 
Figure 4.2. The current section introduces the family of particle filters (PF). It 
is a group of estimators that try to implement the general case. These 
estimators use random samples to represent the probability densities, just 
as in the case of Parzen estimation; see Section 5.3.1. As such, particle filters 
are able to handle nonlinear, non-Gaussian systems, continuous states, dis- 
crete states and even combinations. In the sequel we use probability densities 
(which in the discrete case must be replaced by probability functions). 


4.4.1 Importance sampling 


A Monte Carlo simulation uses a set of random samples generated from 
a known distribution to estimate the expectation of any function of that 
distribution. More specifically, let x%®,k =1,...,K be samples drawn 
from a conditional probability density p(x|z). Then, the expectation of 
any function g(x) can be estimated by: 


15% 5(x'*) (4.71) 
k=1 


IIe 


Elg(x)|z] 


Under mild conditions, the right-hand side asymptotically approximates 
the expectation as K increases. For instance, the conditional expectation 
and covariance matrix are found by substitution of g(x) =x and 
g(x) = (x — x)(x — X)", respectively. 

In the particle filter, the set x) depends on the time index i. It 
represents the posterior density p(x(i)|Z(i)). The samples are called the 
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particles. The density can be estimated from the particles by some 
kernel-based method, for instance, the Parzen estimator to be discussed 
in Section 5.3.1. 

A problem in the particle filter is that we do not know the posterior 
density beforehand. The solution for that is to get the samples from some 
other density, say g(x), called the proposal density. The various members 
of the PF family differ (among other things) in their choice of this 
density. The expectation of g(x) w.r.t. p(x|z) becomes: 


E[g(x)|2] = i g(x)p(xlz)dx 





The factor 1/p(z) is a normalizing constant. It can be eliminated as 
follows: 


= | vewe Zax (4.73) 
= f w(watwax 


Using (4.72) and (4.73) we can estimate E[g(x)|z] by means of a set of 
samples drawn from q(x): 





Elg(x)|z] = = (4.74) 


Being the ratio of two estimates, E[g(x)|z] is a biased estimate. However, 
under mild conditions, E[g(x)|z] is asymptotically unbiased and consistent 
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as K increases. One of the requirements is that g(x) overlaps the support 


of p(x). 


Usually, the shorter notation for the unnormalized importance 


weights w% = w(x") is used. The so-called normalized importance 
weights are wh „ = w% /S> w-, With that, expression (4.74) simpli- 
fies to: 


|] = TOD wgl )) (4.75) 


k=1 


4.4.2 Resampling by selection 
Importance sampling provides us with samples x) and weights w'*)_ 
Taken together, they represent the density p(x|z). However, we can trans- 
form this representation to a new set of samples with equal weights. The 
procedure to do that is selection. The purpose is to delete samples with low 
weights, and to retain multiple copies of samples with high weights. The 
number of samples does not change by this; K is kept constant. The various 
members from the PF family may differ in the way they select the samples. 
However, an often used method is to draw the samples with replacement 
according to a multinomial distribution with probabilities w'*) „. 

Such a procedure is easily accomplished by calculation of the cumu- 
lative weights: 


k 


= wo, (4.76) 


We generate K random numbers r% with k = 1,...,K. These numbers 
must be uniformly distributed between 0 and 1. Then, oS k-th sample 
x q in the new set is a SOP of the j-th sample x") where j is the 
smallest integer for which w"),, > r®. 

Figure 4.21 is an e The figure shows a density p(x) and a 
proposal density q(x). out x% from ae ) an TA p(x) if they 
are provided with weights w% œ p(x'*)) /q(x'*)). These weights are 
visualized in Figure 4.21(d) by the radii of the ae Resampling by 
selection gives an unweighted representation of p(x). In Figure 4.21(e), 
multiple copies of one sample are depicted as a pile. The height of the 
pile stands for the multiplicity of the copy. 
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Figure 4.21 Representation of a probability density. (a) A density p(x). (b) The 
proposal density q(x). (c) 40 samples of q(x). (d) Importance sampling of p(x) using 
the 40 samples from q(x). (e) Selected samples from (d) as an equally weighted 
sample representation of p(x) 


4.4.3 The condensation algorithm 


One of the simplest applications of importance sampling combined 
with resampling by selection is in the so-called condensation algorithm 
(‘conditional density optimization’). The een follows the general 
scheme of Figure 4.2. The prediction density p(x(i)|Z(i— 1)) is used as 
the proposal density g(x). So, at time i, we assume P a set x'*) is 
available which is an unweighted representation of p(x i— 1) 7 We 
use importance sampling to find the posterior density p(x(i)|Z(i)). For 
that purpose we make the following substitutions in (4.72): 


p(x) = p(x()|ZG— 1) 
p(xiz) > plz), ZG — 1)) = pZ) 
a(x) > pex()|Z( — 1) 
plx) > plz) |x), ZE- 1)) = peix) 


(k) 


norm 


The weights wí 
obtained from: 


that define the representation of p(x(i)|z(i)) is 
w” = p(z(i)|x)) (4.77) 


Next, resampling by selection provides an unweighted representation 


x q: The last step is the prediction. Using as q as a representation 


132 STATE ESTIMATION 


for p(x(i)|Z(i)), the representation of p(x(i+1)|Z(i)) is found by 
carte one new sample x% for each sample ee q using 
p(x(i+ 1) xe) ea) a) as the density to draw from. The algorithm is as 
follows: 


Algorithm 4.4: The condensation algorithm 


1. Initialization 


e Seti=0 
e Draw K samples x),k=1,...,K, from the prior probability 
density p(x(0)) 


2. Update using importance sampling: 


e Set the importance weights equal to: w'*) = p(z(i)|x) 
(Bo ves 


e Calculate the normalized importance weights: wit), = 
(k) [5 wh 


3. Resample by selection: 


e Calculate the cumulative weights w'*), = > wn 
e fork=1,...,K 
e Generate a aan number r uniformly distributed in [0, 1] 


e Find ae smallest j such that wt), > r'*) 
(k) 


e Set Xselected — x’ 
4. Predict: 
e Seti =i+1 
e fork=1,...,K: 
e Draw sample x"), from the density p(x(i)|x(i— 1) = xe a 
5. Go to 2 


After step 2, the pO: density is available in terms of the samples x'* 
and the weights w'*), . The MMSE estimate and the associated error 
covariance matrix can be obtained from (4.75). For insance, the MMSE 
is obtained by substitution of g(x) = x. Since we have a representation of 
the posterior density, estimates associated with other criteria can be 
obtained as well. 

The calculation of a oani weights in step 2 involves the 
conditional density p(z(i)|x'*)). In the case of the nonlinear measurement 
functions of the type z(i) = h(x(i)) + v(ż), it all boils down to calculating 
the density of the measurement noise for v(i) = z(i) — h(x'*)). For 
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instance, if v(i) is zero mean, Gaussian with covariance matrix Cy, the 
weights are calculated as: 


w™ = constant x exp (- 5 (z(i) — h(x)) C (z(i) — hix'"))) 


The actual value of constant is irrelevant because of the normalization. 

The drawing of new samples in the prediction step involves the state 
equation. If x(i+ 1) = f(x(i),u(é)) + w(i), then the drawing is governed 
by the density of w(). 

The advantages of the particle filtering are obvious. Nonlinearities 
of both the state equation and the measurement function are handled 
smoothly without the necessity to calculate Jacobian matrices. How- 
ever, the method works well only if enough particles are used. Espe- 
cially for large dimensions of the state vector the required number 
becomes large. If the number is not e then the particles are 
not able to represent the density p(x(i)|Z(i — a )). Particularly, if for 
some values of x(i the likelihood p(z(i)|x(i)) is very large, while on 
these locations p(x(i)|Z(i—1)) is small, the pane filtering may not 
converge. It occurs frequently then that all weights become zero 
except one which becomes unit. The particle filter is said to be 
degenerated. 


Example 4.10 Particle filtering applied to volume density estimation 
The problem of estimating the volume density of a substance mixed 
with a liquid is introduced in Example 4.1. The model, expressed in 
equation (4.3), is nonlinear and non-Gaussian. The state vector 
consists of two continuous variables (volume and density), and one 
discrete variable (the state of the on/off controller). The measure- 
ment system, expressed in equation (4.40), is nonlinear with additive 
Gaussian noise. Example 4.6 has shown that the EKF is able to estimate 
the density, but only by using an approximate model of the process 
in which the discrete state is removed. The price to pay for such a 
rough approximation is that the estimation error of the density and 
(particularly) the volume has a large magnitude. 

The particle filter does not need such a rough approximation 
because it can handle the discrete state variable. In addition, the 
particle filter can cope with discontinuities. These discontinuities 
appear here because of the discrete on/off control, but also because 
the input flow of the substance occurs in chunks at some random 
points in time. 
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The particle filter implemented in this example uses the process 
model given in (4.3), and the measurement model of (4.40). The 
parameters used are tabulated in Example 4.5. Other parameters 
are: Vio, = 3990 (litre) and Vign = 4010 (litre). The random points 
of the substance are modelled as a Poisson process with mean time 
between two points A = 100A = 100 (s). The chunks have an uni- 
form distribution between 7 and 13 (litre). Results of the particle filter 
using 10000 particles are shown in Figure 4.22. The figure shows an 
example of a cloud of particles. Clearly, such a cloud is not 
represented by a Gaussian distribution. In fact, the distribution is 
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Figure 4.22 Application of particle filtering to the density estimation problem. 
(a) Real states and measurements. (b) The particles obtained at i = 511. (c) Results 
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multi-modal due to the uncertainty of the moment at which the on/off 
control switches its state. 

In contrast with the Kalman filter, the particle filter is able to 
estimate the fluctuations of the volume. In addition, the estimation 
of the density is much more accurate. The price to pay is the compu- 
tational cost. 


MATLAB functions for particle filtering 


Many MATLAB users have already implemented particle filters, but no 
formal toolbox yet exists. Section 9.3 contains a listing of MATLAB code 
that implements the condensation algorithm. Details of the implementa- 
tion are also given. 
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4.6 EXERCISES 


1. Consider the following model for a random constant: 


i) ++ v(i) v(i) is white noise with variance o2 


N 
Aa 
SS 

lI 
fad 
© 


The prior knowledge is E[x(0)] = 0 and 2) = 00. Give the expression for the solu- 
tion of the discrete Lyapunov equation. (0). 

2. For the random constant model given in exercise 1, give expressions for the innovation 
matrix S(i), the Kalman gain matrix K(i), the error covariance matrix C(i|i) and the 
prediction matrix C(i|i — 1) for the first few time steps. That is, for i= 0,1,2 and 3. 
Explain the results. (0). 

3. In exercise 2, can you find the expressions for arbitrary i. Can you also prove that these 
expressions are correct? Hint: use induction. Explain the results. (*). 


4. Consider the following time-invariant scalar linear-Gaussian system 


2 


x(i+ 1) = ax(i)+ w(i) w(i) is white noise with variance 07, 


2(i) = x(i) + v(i) v(i) is white noise with variance o2 


The prior knowledge is Ex0 = 0 and ož} = œ. What is the condition for the 
existence of the solution of the discrete Lyapunov equation? If this condition is 
met, give an expression for that solution. (0). 

5. For the system described in exercise 4, give the steady state solution. That is, give 
expressions for S(i), K(i), C(iļi) and C(i|i — 1) if i — oo. (0). 

6. For the system given in exercise 4, give expressions for S(i), K(i), C(i|i) and C(i|i — 1) 
for the first few time steps, that is, i= 0,1, 2 and 3. Explain the results. (0). 
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7. In exercise 4, can you find the expressions for S(i), K(i),C(i|i) and C(i]i— 1) for 


8. 


10. 


11. 


arbitrary i? (*). 
Autoregressive models and MATLAB’s Control System Toolbox: consider the follow- 
ing second order autoregressive (AR) model: 








Using the functions t f() and ss() from the Control Toolbox, convert this AR model 
into an equivalent state space model, that is, x(i+ 1) = Fx(i)+Gw(i) and 
2(i) = Hx(i) + v(i). (Use help tf and help ss to find out how these functions 
should be used.) Assuming that w(i) and v(i) are white noise sequences with variances 
o2,= 1 and o2 = 1, use the function dlyap() to find the solution of the discrete 
Lyapunov equation, and the function kalman() (or dlqe()) to find the solution for 
the steady state Kalman gain and corresponding error covariance matrices. Hint: the 
output variable of the command ss() is a ‘struct’ whose fields are printed by typing 
struct(ss). (*). 


. Moving average models: repeat exercise 8, but now considering the so-called first 


order moving average (MA) model: 


x(it+ 1) (w(i) +w(i-1)) 0% =1 
zli) =x(i)+v(i) of =1 








(*) 
Autoregressive, moving average models: repeat exercise 8, but now considering the 
so-called ARMA(2, 1) model: 


x(i+1) Txi) sali Dee waive iasi 








(x) 
2(i) = x(i) + v(i) =1 


Simulate the processes mentioned in exercise 1, 8, 9 and 10, using MATLAB, and 
apply the Kalman filters. 


5 


Supervised Learning 


One method for the development of a classifier or an estimator is the 
so-called model-based approach. Here, the required availability of the 
conditional probability densities and the prior probabilities are obtained 
by means of general knowledge of the physical process and the sensory 
system in terms of mathematical models. The development of the esti- 
mators for the backscattering coefficient, discussed in Chapter 3, follows 
such an approach. 

In many other applications, modelling the process is very difficult if 
not impossible. For instance, in the mechanical parts application, 
discussed in Chapter 2, the visual appearance of the objects depends 
on many factors that are difficult to model. The alternative to the 
model-based approach is the learning from examples paradigm. Here, 
it is assumed that in a given application a population of objects is 
available. From this population, some objects are selected. These 
selected objects are called the samples. Each sample is presented to 
the sensory system which returns the measurement vector associated 
with that sample. The purpose of learning (or training) is to use these 
measurement vectors of the samples to build a classifier or an estima- 
tor. 

The problem of learning has two versions: supervised and unsuper- 
vised, that is, with or without knowing the true class/parameter of the 
sample. See Figure 5.1. This chapter addresses the first version. Chapter 7 
deals with unsupervised learning. 


Classification, Parameter Estimation and State Estimation: An Engineering Approach using MATLAB 
F. van der Heijden, R.P.W. Duin, D. de Ridder and D.M.J. Tax 
© 2004 John Wiley & Sons, Ltd ISBN: 0-470-09013-8 
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Figure 5.1 Training sets. (a) Labelled. (b) Unlabelled 


The chapter starts with a section on the representation of training sets. 
In Sections 5.2 and 5.3 two approaches to supervised learning are dis- 
cussed: parametric and nonparametric learning. Section 5.4 addresses 
the problem of how to evaluate a classifier empirically. The discussion 
here is restricted to classification problems only. However, many tech- 
niques that are useful for classification problems are also useful for 
estimation problems. Especially Section 5.2 (parametric learning) is 
useful for estimation problems too. 


5.1 TRAINING SETS 


The set of samples is usually called the training set (or: learning data or 
design set). The selection of samples should occur randomly from the 
population. In almost all cases it is assumed that the samples are i.i.d., 
independent and identically distributed. This means that all samples are 
selected from the same population of objects (in the simplest case, with 
equal probability). Furthermore, the probability of one member of the 
population being selected is not allowed to depend on the selection of 
other members of the population. 

Figure 5.1 shows scatter diagrams of the mechanical parts application 
of Chapter 2. In Figure 5.1(a) the samples are provided with a label 
carrying the information of the true class of the corresponding object. 
There are several methods to find the true class of a sample, e.g. manual 
inspection, additional measurements, destructive analysis, etc. Often, 
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these methods are expensive and therefore allowed only if the number of 
samples is not too large. 

The number of samples in the training set is denoted by Ns. The 
samples are enumerated by the symbol n = 1,..., Ns. Object n has a 
measurement vector z,. The true class of the n-th object is denoted by 
6, € Q. Then, a labelled training set Ts contains samples (zy, 0,) each 
one consisting of a measurement vector and its true class: 


Ts = 4 (Z7, Ba) } with n=1,...,Ns (5.1) 


Another representation of the data set is obtained if we split the training 
set according to their true classes: 


Ts = {Zkn} with k=1,...,K and n=1,..., Ng (5.2) 


where N, is the number of samples with class w, and K = |Q| is the 
number of classes. A representation equivalent to (5.2) is to introduce a 
set T, for each class. 


Tk = { (27, 0n) 0n = wk} with k=1,...,K and n= 1,..., N; 
(5.3) 


It is understood that the numberings of samples used in these three 
representations do not coincide. Since the representations are equivalent, 
we have: 


K 
Ns = `> Ng (5.4) 
kal 


In PRTools, data sets are always represented as in (5.1). In order to 
obtain representations as in (5.3), separate data sets for each of the 
classes will have to be constructed. In Listing 5.1 these two ways are 
shown. It is assumed that dat is an N x d matrix containing the meas- 
urements, and lab an N x 1 matrix containing the class labels. 


Listing 5.1 
Two methods of representing data sets in PRTools. The first method is 
used almost exclusively. 


ae 


Create a standard MATLAB dataset from data and labels. 
% Method (5.1): 
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dat = [ 021-029: 30.23.05 95.30: 250.7]; 

Lab={ tclass 1’ class: 27. *class 3” 3} 

z=dataset (dat, lab); 

% Method (5.3): 

[nlab, lablist] =getnlab(z); 
[m,k,c] =getsize(z); 

fora = le 
T{i}=seldat(z,i); 

end; 


oe 


Extract the numeric labels 
Extract number of classes 





oe 


5.2 PARAMETRIC LEARNING 


The basic assumption in parametric learning is that the only unknown 
factors are parameters of the probability densities involved. Thus, 
learning from samples boils down to finding the suitable values of these 
parameters. The process is analogous to parameter estimation discussed 
in Chapter 3. The difference is that the parameters in Chapter 3 
describe a physical process whereas the parameters discussed here are 
parameters of the probability densities of the measurements of the 
objects. Moreover, in parametric learning a set of many measurement 
vectors is available rather than just a single vector. Despite these two 
differences, the concepts from Chapter 3 are fully applicable to the 
current chapter. 

Suppose that z, are the samples coming from a same class wg. These 
samples are repeated realizations of a single random vector z. An alter- 
native view is to associate the samples with single realizations coming 
from a set of random vectors with identical probability densities. Thus, a 
training set T, consists of N, mutually independent, random vectors Z,. 
The joint probability density of these vectors is 


Nz 


P(Z1,Z2,.--,ZN,|We, Œk) = J [zor ær) (5.5) 


n=1 


a, is the unknown parameter vector of the conditional probability 
density p(z|wk, Œk). Since in parametric learning we assume that the form 
of p(z|wk, Œk) is known (only the parameter vector Œp is unknown), 
the complete machinery of Bayesian estimation (minimum risk, MMSE 
estimation, MAP estimation, ML estimation) becomes available to find 
estimators for the parameter vector æg: see Section 3.1. Known concepts 
to evaluate these estimators (bias and variance) also apply. The next 
subsections discuss some special cases for the probability densities. 
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5.2.1 Gaussian distribution, mean unknown 


Let us assume that under class w, the measurement vector z is a Gaussian 
random vector with known covariance matrix C, and unknown expect- 
ation vector W. No prior knowledge is available concerning this 
unknown vector. The purpose is to find an estimator for My. 

Since no prior knowledge about 4, is assumed, a maximum likelihood 
estimator seems appropriate (Section 3.1.4). Substitution of (5.5) in (3.22) 
gives the following general expression of a maximum likelihood estimator: 


Ny 
it, = sand [pelea 
ae (5.6) 
- samen $> niles) 
n=1 


m 


The logarithms introduced in the last line transform the product into 
a summation. This is only a technical matter which facilitates the 
maximization. 

Knowing that z is Gaussian, the likelihood of 4, from a single observ- 
ation Zņ is: 





1 1 RE 
P(Zn|we, Me) = exp (Zn — My) Ck (Zn — My) | (5.7) 
aC © i ) 


Upon substitution of (5.7) in (5.6), rearrangement of terms and elimin- 
ation of irrelevant terms, we have: 


Nz 
Mp = gin Sta z H)” Cg (Zn = 7 


n=1 


N; N; N; 
= ging S z! C} Int Souci," =2 > cin} 
H n=1 n=1 n=1 


(5.8) 


Differentiating the expression between braces with respect to u (Appen- 
dix B.4) and equating the result to zero yields the average or sample 
mean calculated over the training set: 


1% 
7 = — Zn 5.9 
Ly N, 2 (5.9) 
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Being a sum of Gaussian random variables, this estimate has a Gaussian 
distribution too. The expectation of the estimate is: 


i a fice 
Ej&,] = Nie E[zn] = N, alte = Uk (5.10) 
n=1 n=1 


where 4; is the true expectation of z. Hence, the estimation is unbiased. 
The covariance matrix of the estimation error is found as: 


Ch, = E| — My) (Me = Hy)" = TC (5.11) 


The proof is left as an exercise for the reader. 


5.2.2 Gaussian distribution, covariance matrix unknown 


Next, we consider the case where under class w the measurement vector 

z is a Gaussian random vector with unknown covariance matrix C,. For 

the moment we assume that the expectation vector 4, is known. No 

prior knowledge is available. The purpose is to find an estimator for Cy. 
The maximum likelihood estimate follows from (5.5) and (5.7): 


N, N 

7 k 1 œ 

C; = sagan $> In(p(Zp|we, o») = N, 2% — My) (Zn — Hg)” 
n=1 n=1 


(5.12) 


The last step in (5.12) is non-trivial. The proof is rather technical and 
will be omitted. However, the result is plausible since the estimate is the 
average of the N; matrices (Z, — M,)(Zn — L)” whereas the true covari- 
ance matrix is the expectation of (z — U,)(Zz — Ly)’. j 

The probability distribution of the random variables in C; is a Wishart 
distribution. The estimator is unbiased. The variances of the elements of 
C; are: 


A 1 
Var[Cg, ;] = N; (Cri 35 Ck, ) (5.13) 


PARAMETRIC LEARNING 145 


5.2.3 Gaussian distribution, mean and covariance matrix 
both unknown 


If both the expectation vector and the covariance matrix are unknown, 
the estimation problem becomes more complicated because then we 
have to estimate the expectation vector and covariance matrix simultan- 
eously. It can be deduced that the following estimators for C} and 4, are 
unbiased: 


(5.14) 





C, is called the sample covariance. Comparing (5.12) with (5.14) we 
note two differences. In the latter expression the unknown expectation 
has been replaced with the sample mean. Furthermore, the divisor N, 
has been replaced with N,—1. Apparently, the lack of knowledge of 4, 
in (5.14) makes it necessary to sacrifice one degree of freedom in the 
averaging operator. For large Nx, the difference between (5.12) and 
(5.14) vanishes. 

In classification problems, often the inverse C,' is needed, for 
instance, in uani like: z"C%!u, 27C;,'z, etc. Often, C;' is used as 
an esate of C,'. To determine the number of samples required such 
that C,! becomes an accurate estimate of C;!, the variance of Cg, given 
in (5.13), is not very helpful. To see this it is instructive to rewrite the 
inverse as (see Appendix B.5 and C.3.2): 


C,’ = VA; VI (5.15) 


where A, is a diagonal matrix containing the eigenvalues of Cy. Clearly, 
the behaviour of C;' is strongly affected by small eigenvalues in Ag. 
In fact, the number of nonzero eigenvalues in the estimate C, given in 
(5.14) cannot exceed N, — 1. If N, — 1 is smaller than the dimension 
N of the measurement vector, the estimate C, is not invertible. There- 
fore, we must require that N; is (much) larger than N. As a rule of 
thumb, the number of samples must be chosen such that at least 
N; > SN. 
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In order to reduce the sensitivity to statistical errors, we might also 
want to regularize the inverse operation. Suppose that A is the diagonal 
matrix containing the eigenvalues of the estimated covariance matrix C 
(we conveniently drop the index k for a moment). V is the matrix 
containing the eigenvectors. Then, we can define a regularized inverse 
operation as follows: 


2 —1 
I} Vv’ 0<y<1 (5.16) 





C! Vi (1-yA+7 


regularized — 


where trace(A)/N is the average of the eigenvalues of C. y is a regular- 
ization parameter. The effect is that the influence of the smallest eigen- 
values is tamed. A simpler implementation of (5.16) is (see exercise 2): 


=f 





trace(C) I 


saj Boi 
C (1- WC ty — N 


regularized — 


(5.17) 


Another method to regularize a covariance matrix estimate is by sup- 
pressing all off-diagonal elements to some extent. This is achieved by 
multiplying these elements by a factor that is selected between 0 and 1. 


Example 5.1 Classification of mechanical parts, Gaussian 
assumption 

We now return to the example in Chapter 2 where mechanical parts 
like nuts and bolts, etc. must be classified in order to sort them. See 
Figure 2.2. In Listing 5.2 the PRTools procedure for training and 
visualizing the classifiers is given. Two classifiers are trained: a linear 
classifier (1dc) and a quadratic classifier (qdc). The trained classifiers 
are stored in w_1 and w_q, respectively. Using the plotc function, 
the decision boundaries can be plotted. In principle this visualization 
is only possible in 2D. For data with more than two measurements, 
the classifiers cannot be visualized. 


Listing 5.2 

PRTools code for training and plotting linear and quadratic discrimin- 
ants under assumption of normal distributions of the conditional prob- 
ability densities. 


load nutsbolts; % Load the mechanical parts dataset 
w_l=ldc(z,0,0.7); % Traina linear classifier onz 
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oe 


w_q=qdc(z,0,0.5); 
figure; scatterd(z); 


Train a quadratic classifier onz 


o 


Show scatter diagram of z 

plotc(w_l); Plot the first classifier 

plote (w_q, aT} Plot the second classifier 

[0.4 0.2]*w_l*xlabeld % Classify anew object with z=[0.4 0.2] 


oe 


ae 


Figure 5.2 shows the decision boundaries obtained from the data shown 
in Figure 5.1(a) assuming Gaussian distributions for each class. The 
discriminant in Figure 5.2(a) assumes that the covariance matrices for 
different classes are the same. This assumption yields a Mahalanobis 
distance classifier. The effect of the regularization is that the classifier 
tends to approach the Euclidean distance classifier. Figure 5.2(b) 
assumes unequal covariance matrices. The effect of the regularization 
here is that the decision boundaries tend to approach circle segments. 


5.2.4 Estimation of the prior probabilities 


The prior probability of a class is denoted by P(w,). There are exactly 
K classes. Having a labelled training set with Ns samples (randomly 
selected from a population), the number N, of samples with class wg 
has a so-called multinomial distribution. If K =2 the distribution is 
binomial. See Appendix C.1.3. 

The multinomial distribution is fully defined by K parameters. In 
addition to the K — 1 parameters that are necessary to define the prior 


~ 
o 
oS 
= 
oa 
= 



































+ 1 
= > 
g mea y © 08 
È A ee = 
O ___ frcesssseereeeeen™ ~=0.7 5 
i Q 
8 7=0 8 0.6 
5 ; 5 
o ae © 0.4 
è 228, j è 0.2 
TRE CEE 
E 0 = 
0 02 04 06 08 1 0 02 04 06 08 1 
measure of six-fold rotational symmetry measure of six-fold rotational symmetry 


Figure 5.2 Classification assuming Gaussian distributions. (a) Linear decision 
boundaries. (b) Quadratic decision boundaries 
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probabilities, we need to specify one extra parameter, Ns, which is the 
number of samples. 
Intuitively, the following estimator is appropriate: 


Plog) = x (5.18) 


The expectation of N, equals NsP(w,). Therefore, P(w,) is an unbiased 
estimate of P(w,). The variance of a multinomial distributed variable is 
NsP(wp)(1 — P(wg)). Consequently, the variance of the estimate is: 


P(w) (1 — P(we)) 


Var[P(w»)] = N; 





(5.19) 


This shows that the estimator is consistent. That is if Ns — oo, then 


Var[P(w,)] ~ 0. The required number of samples follows from the 


constraint that /Var[P(w,)] << P(w,). For instance, if for some class 
we anticipate that P(w) = 0.01, and the permitted relative error is 20%, 


i.e. 4/ Var[P(w,)] = 0.2P(w,), then Ns must be about 2500 in order to 


obtain the required precision. 


5.2.5 Binary measurements 


Another example of a multinomial distribution occurs when the meas- 
urement vector z can only take a finite number of states. For instance, 
if the sensory system is such that each element in the measurement 
vector is binary, i.e. either ‘1’ or ‘0’, then the number of states the 
vector can take is at most 2N. Such a binary vector can be replaced 
with an equivalent scalar z that only takes integer values from 1 up to 
2N, The conditional probability density p(z|w,) turns into a probability 
function P(z|w,). Let Nz(z) be the number of samples in the training 
set with measurement z and class wp. N;,(z) has a multinomial 
distribution. 

At first sight, one would think that estimating P(z|w,) is the same type 
of problem as estimating the prior probabilities such as discussed in the 
previous section: 
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P(z|w) = a (5.20) 
Var[P(zlup)] = PEW = Palen) (5.21) 


Nk 


For small N and a large training set, this estimator indeed suffices. 
However, if N is too large, the estimator fails. A small example demon- 
strates this. Suppose the dimension of the vector is N = 10. Then the 
total number of states is 2!° ~ 10°. Therefore, some states will have a 
probability of less than 10-3. The uncertainty of the estimated probabil- 
ities must be a fraction of that, say 10-4. The number of samples, Ng, 
needed to guarantee such a precision is on the order of 10° or more. 
Needless to say that in many applications 10° samples is much too 
expensive. Moreover, with even a slight increase of N the required 
number of samples becomes much larger. 

One way to avoid a large variance is to incorporate more prior know- 
ledge. For instance, without the availability of a training set, it is known 
beforehand that all parameters are bounded by 0 < P(z|w,) < 1. If noth- 
ing further is known, we could first ‘guess’ that all states are equally 
likely: P(z|w) = 27". Based on this guess, the estimator takes the form: 


Nz (Zz) +1 
P(z|we) = N, 42N (5.22) 
The variance of the estimate is: 
vbu s ee le) (5.23) 





(Nian 
Comparing (5.22) and (5.23) with (5.20) and (5.21) the conclusion is 


that the variance of the estimate is reduced at the cost of a small bias. See 
also exercise 4. 


5.3. NONPARAMETRIC LEARNING 


Nonparametric methods are learning methods for which prior knowledge 
about the functional form of the conditional probability distributions is 
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not available or is not used explicitly. The name may suggest that no 
parameters are involved. However, in fact these methods often require 
more parameters than parametric methods. The difference is that in 
nonparametric methods the parameters are not the parameters of the 
conditional distributions. 

At first sight, nonparametric learning seems to be more difficult than 
parametric learning because nonparametric methods exploit less know- 
ledge. For some nonparametric methods this is indeed the case. In 
principle, these types of nonparametric methods can handle arbitrary 
types of conditional distributions. Their generality is high. The downside 
of this advantage is that large to very large training sets are needed to 
compensate the lacking knowledge about the densities. 

Other nonparametric learning methods cannot handle arbitrary types 
of conditional distributions. The classifiers being trained are constrained 
to some preset computational structure of their decision function. By 
this, the corresponding decision boundaries are also constrained. An 
example is the linear classifier already mentioned in Section 2.1.2. Here, 
the decision boundaries are linear (hyper)planes. The advantage of 
incorporating constraints is that fewer samples in the training set are 
needed. The stronger the constraints are, the fewer samples are needed. 
However, good classifiers can only be obtained if the constraints that are 
used match the type of the underlying problem-specific distributions. 
Hence, in constraining the computational structure of the classifier 
implicit knowledge of the distribution is needed. 


5.3.1 Parzen estimation and histogramming 


The objective of Parzen estimation and histogramming is to obtain 
estimates of the conditional probability densities. This is done without 
much prior knowledge of these densities. As before, the estimation is 
based on a labelled training set Ts. We use the representation according 
to (5.3), i.e. we split the training set into K subsets T}, each having N, 
samples all belonging to class wg. The goal is to estimate the conditional 
density p(z|w,) for arbitrary z. 

A simple way to reach the goal is to partition the measurement space 
into a finite number of disjoint regions R;, called bins, and to count the 
number of samples that falls in each of these bins. The estimated prob- 
ability density within a bin is proportional to that count. This technique 
is called histogramming. Suppose that N}; is the number of samples 
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with class w, that fall within the i-th bin. Then the probability density 
within the i-th bin is estimated as: 


. Nei 
p(z|w,) = TORIAN, with z€ R; (5.24) 





For each class, the number Ng; has a multinomial distribution with 
parameters 


Pki = f p(z\w,)dz with i= Deed bee 
ZER; 


where Nin is the number of bins. The statistical properties of p(z|w,) 
follows from arguments that are identical to those used in Section 5.2.5. 
In fact, if we quantize the measurement vector to, for instance, the 
nearest centre of gravity of the bins, we end up in a situation similar to 
the one of Section 5.2.5. The conclusion is that histogramming works 
fine if the number of samples within each bin is sufficiently large. With a 
given size of the training set, the size of the bins must be large enough to 
assure a minimum number of samples per bin. Hence, with a small 
training set, or a large dimension of the measurement space, the resolu- 
tion of the estimation will be very poor. 

Parzen estimation can be considered as a refinement of histogram- 
ming. The first step in the development of the estimator is to consider 
only one sample from the training set. Suppose that z; € T}. Then, we 
are certain that at this position in the measurement space the density is 
nonzero, i.e. p(zj|wz) # 0. Under the assumption that p(z|w,) is contin- 
uous over the entire measurement space it follows that in a small 
neighbourhood of z; the density is likely to be nonzero too. However, 
the further we move away from z;, the less we can say about p(z|w,). The 
basic idea behind Parzen estimation is that the knowledge gained by the 
observation of zj is represented by a function positioned at zj and with an 
influence restricted to a small vicinity of zj. Such a function is called the 
kernel of the estimator. It represents the contribution of z; to the esti- 
mate. Summing together the contributions of all vectors in the training 
set yields the final estimate. 

Let p(z, zj) be a distance measure (Appendix A.2) defined in the meas- 
urement space. The knowledge gained by the observation z; € T, is 
represented by the kernel h(p(z, z;)) where /(-) is a function R* — R* 
such that h(p(z, z;)) has its maximum at z = zj, i.e. at p(z, zj) = 0. Further- 
more, )(p(-,-)) must be monotonically decreasing as p(-,-) increases, 
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and h(p(-,-)) must be normalized to one, i.e. f h(p(z, z;))dz = 1 where 
the integration extends over the entire measurement space. 

The contribution of a single observation z; is /(p(z, z;)). The contribu- 
tions of all observations are summed to yield the final Parzen estimate: 


P(zlw,) = DI h(p(z,z;)) (5.25) 


Nk T, 


The kernel h(p(-,-)) can be regarded as an interpolation function that 
interpolates between the samples of the training set. 

Figure 5.3 gives an example of Parzen estimation in a one-dimensional 
measurement space. The plot is generated by the code in Listing 5.3. The 
true distribution is zero for negative z and has a peak value near z = 1 
after which it slowly decays to zero. Fifty samples are available (shown 
at the bottom of the figure). The interpolation function chosen is a 
Gaussian function with width øp. The distance measure is Euclidean. 
Figure 5.3(a) and Figure 5.3(b) show the estimations using o, = 1 and 
op = 0.2, respectively. These graphs illustrate a phenomenon related to 
the choice of the interpolation function. If the interpolation function is 
peaked, the influence of a sample is very local, and the variance of the 
estimator is large. But if the interpolation is smooth, the variance 
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Figure 5.3 Parzen estimation of a density function using 50 samples. (a) op = 1. 
(b) o, = 0.2 
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decreases, but at the same time the estimate becomes a smoothed version 
of the true density. That is, the estimate becomes biased. By changing the 
width of the interpolation function one can balance between the bias and 
the variance. Of course, both types of errors can be reduced by enlarge- 
ment of the training set. 


Listing 5.3 

The following PRTools listing generates plots similar to those in Figure 
5.3. It generates 50 samples from a I'(a, b) distribution with a = 2 and 
b = 1.5, and then estimates the density using the Parzen method with 
two different kernel widths. 


n=503-aH=2) bH1..5; 

eS (5250201210) "s+ y=campdt (žab)? 
z=dataset (gamrnd(a,b,n,1),genlab(n)); 
w=parzenm(z,1); 

figure; scatterd(z); axis([-—21000.3]); 
plotm(w,1); hold on; plot(x,y,'’:'); 
w=parzenm(z,0.2); 

figure; scatterd(z); axis([—21000.3]); 
plotmitw,1)- hold on; plot Gey, 31 )3 


oe 


Generate function 
Generate dataset 
Parzen, sigma=1 


oe 


oe 


o 


Parzen, sigma=0.2 


In the N-dimensional case, various interpolation functions are useful. 
A popular one is the Gaussian function: 





plzz) = y (z-z) 1C (z-z) 


; p (5.26) 
a% y/ (27)™]C| k 





The matrix C must be symmetric and positive definite (Appendix B.5). 
The metric is according to the Mahalanobis distance. The constant o, 
controls the size of the influence zone of h(-). It can be chosen smaller as 
the number of samples in the training set increases. If the training set is 
very large, the actual choice of C is less important. If the set is not very 
large, a suitable choice is the covariance matrix C, determined according 
to (5.14). 

The following algorithm implements a classification based on Parzen 
estimation: 
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Algorithm 5.1 Parzen classification 
Input: a labelled training set Ts, an unlabelled test set T. 


1. Determination of op: maximize the log-likelihood of the training set 
Ts by varying c, using leave-one-out estimation (see Section 5.4). 
In other words, select op such that 


is maximized. Here, zg; is the j-th sample from the k-th class, which is 
left out during the estimation of f(zg |wk). 

2. Density estimation: compute for each sample z in the test set the 
density for each class: 


; 1 1 liz - ll? 
pla) = ap(- 
INE ZjETp on (2n)% 20; 





3. Classification: assign the samples in T to the class with the maximal 
posterior probability: 


ô= wg with k= argmax{ p(z|w))P(wi) } 
i=1,-,K 


Output: the labels w of T. 


Example 5.2 Classification of mechanical parts, Parzen estimation 
We return to Example 2.2 in Chapter 2, where mechanical parts like 
nuts and bolts, etc. must be classified in order to sort them. Applica- 
tion of Algorithm 5.1 with Gaussians as the kernels and estimated 
covariance matrices as the weighting matrices yields o, = 0.0485 as 
the optimal sigma. Figure 5.4(a) presents the estimated overall den- 
sity. The corresponding decision boundaries are shown in Figure 
5.4(b). To show that the choice of o, significantly influences the 
decision boundaries, in Figure 5.4(c) a similar density plot is shown, 
for which c, was set to 0.0175. The density estimate is more peaked, 
and the decision boundaries (Figure 5.4(d)) are less smooth. 

Figures 5.4(a—d) were generated using MATLAB code similar to that 
given in Listing 5.3. 
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Figure 5.4 Probability densities of the measurements shown in Figure 5.1. (a) The 
3D plot of the Parzen estimate of the unconditional density together with a 2D 
contour plot of this density on the ground plane. The parameter øo, was set to 
0.0485. (b) The resulting decision boundaries. (c) Same as (a), but with o, set to 
0.0175. (d) Same as (b), for the density estimate shown in (c) 


5.3.2 Nearest neighbour classification 


In Parzen estimation, each sample in the training set contributes in a like 
manner to the estimate. The estimation process is space-invariant. 
Consequently, the trade-off which exists between resolution and vari- 
ance is a global one. A refinement would be to have an estimator with 
high resolution in regions where the training set is dense, and with low 
resolution in other regions. The advantage is that the balance between 
resolution and variance can be adjusted locally. 
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Nearest neighbour estimation is a method that implements such a 
refinement. The method is based on the following observation. Let 
R(z) C RN be a hypersphere with volume V. The centre of R(z) is z. If 
the number of samples in the training set T; is Nz, then the probability of 
having exactly n samples within R(z) has a binomial distribution with 
expectation: 


E[n] = Ng f plylwx)dy = Ng Vp(zlwr) (5.27) 
yER(z) 


Suppose that the radius of the sphere around z is selected such that this 
sphere contains exactly x samples. It is obvious that this radius depends 
on the position z in the measurement space. Therefore, the volume will 
depend on z. We have to write V(z) instead of V. With that, an estimate 
of the density is: 


(zlo) = NVO (5.28) 


The expression shows that in regions where p(z|w,) is large, the volume 
is expected to be small. This is similar to having a small interpolation 
zone. If, on the other hand, p(z|w,) is small, the sphere needs to grow in 
order to collect the required « samples. 

The parameter « controls the balance between the bias and variance. 
This is like the parameter c, in Parzen estimation. The choice of « should 
be such that: 


k—oœo as N,— oo inorder to obtain a low variance (5.29) 
k/Np—0 as Ng — œ in order to obtain a low bias > 


A suitable choice is to make « proportional to VN}. 

Nearest neighbour estimation is of practical interest because it paves 
the way to a classification technique that directly uses the training set, 
i.e. without explicitly estimating probability densities. The develop- 
ment of this technique is as follows. We consider the entire training 
set and use the representation Ts as in (5.1). The total number of 
samples is Ns. Estimates of the prior probabilities follow from (5.18): 
Piw) = N,/Ns. 

As before, let R(z) c R be a hypersphere with volume V(z). In order 
to classify a vector z we select the radius of the sphere around z such that 
this sphere contains exactly x samples taken from Ts. These samples are 
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called the «-nearest neighbours’ of z. Let rp denote the number of 
samples found with class wz. An estimate of the conditional density is 
(see (5.28)): 


Ke 


ave (5.30) 


p(z|we) ~ 


Combination of (5.18) and (5.30) in the Bayes classification with 
uniform cost function (2.12) produces the following suboptimal classi- 
fication: 


(z) = up with k= argmax{ (zlu) P(w) = = argmax SIG WG z)N x} 


i=1,...,K i 
= argmax{k;} 
i=1,....K 


pe 


(5.31) 


The interpretation of this classification is simple. The class assigned to a 
vector z is the class with the maximum number of votes coming from « 
samples nearest to z. In literature, this classification method is known as 
k-nearest neighbour rule classification (k-NNR, but in our nomenclature 
&-NNR). The special case in which « = 1 is simply referred to as nearest 
neighbour rule classification (NNR or 1-NNR). 


Example 5.3 Classification of mechanical parts, NNR classification 
PRTools can be used to perform «-nearest neighbour classification. 
Listing 5.4 shows how a «-nearest neighbour classifier is trained on 
the mechanical parts data set of Example 2.2. If « is not specified, it 
will be found by minimizing the leave-one-out error. For this data set, 
the optimal « is 7. The resulting decision boundaries are shown in 
Figure 5.5(a). If we set « to 1, the classifier will classify all samples in 
the training set correctly (see Figure 5.5(b)), but its performance on a 
test set will be worse. 





'The literature about nearest neighbour classification often uses the symbol k to denote the 
number of samples in a volume. However, in order to avoid confusion with symbols like wg, Tk, 
etc. we prefer to use «K. 
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Figure 5.5 Application of k-NNR classification. (a) s = 7. (b) x = 1 


Listing 5.4 
PRTools code for finding and plotting an optimal «-nearest neighbour 
classifier and a one-nearest neighbour classifier. 


load nutsbolts; 
[w, k] =knnc (z); 


oe 


Load the dataset 
Train a k-NNR 


disp(k); % Show the optimal k found 
figure; scatterd(z) % Plot the dataset 
plotc(w); Plot the decision boundaries 


Train a 1-NNR 
Plot the dataset 
Plot the decision boundaries 


w=knnc(z,1); 
figure; scatterd(z); 
plotc(w); 


oP dP oP 





oe 


The analysis of the performance of «-nearest neighbour classification is 
difficult. This holds true especially if the number of samples in the 
training set is finite. In the limiting case, when the number of samples 
grows to infinity, some bounds on the error rate can be given. Let the 
minimum error rate, i.e. the error rate of a Bayes classifier with uniform 
cost function, be denoted by Emin. See Section 2.1.1. Since Emin is the 
minimum error rate among all classifiers, the error rate of a k-NNR, 
denoted E,,, is bounded by: 


Emin < Ek (5.32) 


It can be shown that for the 1-NNR the following upper bound holds: 
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K 
< ; zn n : < ; ; 
E1 < Emin (2 Kol Eni) < 2Emin (5.33) 


Apparently, replacing the true probability densities with estimations 
based on the first nearest neighbour gives an error rate that is at most 
twice the minimum. Thus, at least half of the classification information 
in a dense training set is contained in the first nearest neighbour. 

In the two-class problem (K = 2) the following bound can be proven: 


E : . 
E, < Emin 4 È if k is odd 


= /0.5(K — 1)r (5.34) 


E, = E if x is even 





The importance of (5.34) is that it shows that the performance of the 
&k-NNR approximates the optimum as « increases. This asymptotic 
optimality holds true only if the training set is dense (Ns — oo). Never- 
theless, even in the small sized training set given in Figure 5.5 it can be 
seen that the 7-NNR is superior to the 1-NNR. The topology of the 
compartments in Figure 5.5(b) is very specific for the given training set. 
This is in contrast with the topology of the 7-NNR in Figure 5.5(a). The 
7-NNR generalizes better than the 1-NNR. 

Unfortunately, K-NNR classifiers also have some serious disadvantages: 


e A distance measure is needed in order to decide which sample in the 
training set is nearest. Usually the Euclidean distance measure is 
chosen, but this choice needs not to be optimal. A systematic 
method to determine the optimal measure is hard to find. 

e The optimality is reached only when « — oo. But since at the same 
time it is required that «/Ns — 0, the demand on the size of the 
training set is very high. If the size is not large enough, k-NNR 
classification may be far from optimal. 

e If the training set is large, the computational complexity of K-NNR 
classification becomes a serious burden. 


Many attempts have been made to remedy the last drawback. One method 
is to design fast algorithms with suitably chosen data structures (e.g. 
hierarchical data structures combined with a pre-ordered training set). 
Another method is to preprocess the training set so as to speed up the 
search for nearest neighbours. There are two principles on which this 
reduction can be based: editing and condensing. The first principle is to 
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edit the training set such that the (expensive) k-NNR can be replaced 
with the 1-NNR. The strategy is to remove those samples from the 
training set that when used in the 1-NNR would cause erroneous results. 
An algorithm that accomplishes this is the so-called multi-edit algorithm. 
The following algorithm is from Devijver and Kittler. 


Algorithm 5.2 Multi-edit 
Input: a labelled training set Ts. 


1. Diffusion: Partition the training set Ts randomly into L disjunct 
subsets: T4, T5,..., T; with Ts = UT}, and L > 3. 

2. Classification: classify the samples in T7 using 1-NNR classification 

with Te +1)mod L 4S training set. 

Editing: discard all the samples that were misclassified at step 2. 

4. Confusion: pool all the remaining samples to constitute a new train- 
ing set Ts. 

5. Termination: if the last I iterations produced no editing, then exit 
with the final training set, else go to step 1. 


W 


Output: a subset of Ts. 


The subsets created in step 1 are regarded as independent random 
selections. A minimum of three subsets is required in order to avoid a 
two-way interaction between two subsets. Because in the first step the 
subsets are randomized, it cannot be guaranteed that, if during one 
iteration no changes in the training set occurred, changes in further 
iterations are ruled out. Therefore, the algorithm does not stop immedi- 
ately after an iteration with no changes has occurred. 

The effect of the algorithm is that ambiguous samples in the training 
set are removed. This eliminates the need to use the k-NNR. The 1-NNR 
can be used instead. 

The second principle, called condensing, aims to remove samples that 
do not affect the classification in any way. This is helpful to reduce the 
computational cost. The algorithm — also from Devijver and Kittler — is 
used to eliminate all samples in the training set that are irrelevant. 


Algorithm 5.3 Condensing 
Input: a labeled training set Ts. 


1. Initiation: set up two new training sets Tsrorg and Tgrappac; place 
the first sample of Ts in Tsyorg, all other samples in TGRABBAG- 


NONPARAMETRIC LEARNING 161 


2. Condensing: use 1-NNR classification with the current TsTorg to 
classify a sample in Torageac; if classified correctly, the sample is 
retained in Torappac, Otherwise it is moved from TGRABBAG to 
Tstore; repeat this operation for all other samples in TGRABBAG. 

3. Termination: if one complete pass is made through step 2 with no 
transfer from TGragsBaG to Tsrore, or if Tcrappac is empty, then 
terminate; else go to step 2. 


Output: a subset of Ts. 


The effect of this algorithm is that in regions where the training set is 
overcrowded with samples of the same class most of these samples will 
be removed. The remaining set will, hopefully, contain samples close to 
the Bayes decision boundaries. 


Example 5.4 Classification of mechanical parts, editing and 
condensation 
An example of a multi-edited training set is given in Figure 5.6(a). The 
decision boundaries of the 1-NNR classifier are also shown. It can be 
seen that the topology of the resulting decision function is in accord- 
ance with the one of the 7-NNR given in Figure 5.5(a). Hence, multi- 
editing improves the generalization property. 

Figure 5.6(b) shows that condensing can be successful when applied 
to a multi-edited training set. The decision boundaries in Figure 5.6(b) 
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Figure 5.6 Application of editing and condensing. (a) Edited training set. (b) Edited 
and condensed training set 
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are close to those in Figure 5.6(a), especially in the more important 
areas of the measurement space. 

The basic PRTools code used to generate Figure 5.6 is given in 
Listing 5.5. 


Listing 5.5 

PRTools code for finding and plotting one-nearest neighbour classifiers 
on both an edited and a condensed data set. The function edicon takes 
a distance matrix as input. In PRTools, calculating a distance matrix is 
implemented as a mapping proxm, so z * proxm(z) is the distance 
matrix between all samples in z. See Section 7.2. 


Load the dataset z 
Edit z 


ae 


load nutsbolts; 
J=edicon(z*proxm(z),3,5,[]); 


ae 








w=knnc(z(J,:),1); % Train a 1-NNR 
figure; scatterd(z(J,:)); plotc(w); 
J=edicon(z*proxm(z),3,5,10); % Edit and condense z 
w=knnc(z(J,:),1); % Train a 1-NNR 
figure; scatterd(z(J,:)); plotc(w); 


If a non-edited training set is fed into the condensing algorithm, it may 
result in erroneous decision boundaries, especially in areas of the meas- 
urement space where the training set is ambiguous. 


5.3.3 Linear discriminant functions 


Discriminant functions are functions g;(z), k= 1,...,K that are used 
in a decision function as follows: 


(Z) = Ww, with: n = argmax{g,(z)} (5.35) 
k=1,...,K 


Clearly, if g,(z) are the posterior probabilities P(w,|z), the decision 
function becomes a Bayes decision function with a uniform cost func- 
tion. Since the posterior probabilities are not known, the strategy is to 
replace the probabilities with some predefined functions g,(z) whose 
parameters should be learned from a labelled training set. 

An assumption often made is that the samples in the training set can 
be classified correctly with linear decision boundaries. In that case, the 
discriminant functions take the form of: 


Se(Z) = WEZ + WE (5.36) 
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Functions of this type are called linear discriminant functions. In fact, 
these functions implement a linear machine. See also Section 2.1.2. 

The notation can be simplified by the introduction of an augmented 
measurement vector y, defined as: 


y= H (5.37) 


With that, the discriminant functions become: 


ge(y) = wey (5.38) 


where the scalar wy in (5.36) has been embedded in the vector w, by 
augmenting the latter with the extra element wy. 

The augmentation can also be used for a generalization that allows for 
nonlinear machines. For instance, a quadratic machine is obtained with: 


ylz = |z 1 2% z an Rew Seti Bows an zn-12n | (5.39) 
The corresponding functions g,(y) = w_y(z) are called generalized linear 
discriminant functions. 

Discriminant functions depend on a set of parameters. In (5.38) these 
parameters are the vectors wy. In essence, the learning process boils down 
to a search for parameters such that with these parameters the decision 
function in (5.35) correctly classifies all samples in the training set. 

The basic approach to find the parameters is to define a performance 
measure that depends on both the training set and the set of parameters. 
Adjustment of the parameters such that the performance measure is 
maximized gives the optimal decision function; see Figure 5.7. 
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Figure 5.7 Training by means of performance optimization 
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Strategies to adjust the parameters may be further categorized into ‘itera- 
tive’ and ‘non-iterative’. Non-iterative schemes are found when the perfor- 
mance measure allows for an analytic solution of the optimization. For 
instance, suppose that the set of parameters is denoted by w and that the 
performance measure is a continuous function J(w) of w. The optimal solu- 
tion is one which maximizes J(w). Hence, the solution must satisfy iw) =0. 

In iterative strategies the procedure to find a solution is numerical. 
Samples from the training set are fed into the decision function. The 
classes found are compared with the true classes. The result controls the 
adjustment of the parameters. The adjustment is in a direction which 
improves the performance. By repeating this procedure it is hoped that 
the parameters iterate towards the optimal solution. 

The most popular search strategy is the gradient ascent method (also 
called steepest ascent)". Suppose that the performance measure J(w) is a 
continuous function of the parameters contained in w. Furthermore, 
suppose that VJ(w) = w) is the gradient vector. Then the gradient 
ascent method updates the parameters according to: 


w(i + 1) = w(i) + (2) VJ(w(i)) (5.40) 


where w(i) is the parameter obtained in the i-th iteration. 7(i) is the 
so-called learning rate. If n(i) is selected too small, the process converges 
very slowly, but if it is too large, the process may overshoot the maximum, 
or oscillate near the maximum. Hence, a compromise must be found. 

Different choices of the performance measures and different search 
strategies lead to a multitude of different learning methods. This section 
confines itself to two-class problems. From the many iterative, gradient- 
based methods we only discuss ‘perceptron learning’ and the ‘least 
squared error learning’. Perhaps the practical significance of these two 
methods is not very large, but they are introductory to the more involved 
techniques of succeeding sections. 


Perceptron learning 


In a two-class problem, the decision function expressed in (5.35) is 
equivalent to a test g1(y) — g2(y) > 0. If the test fails, it is decided for 
w2, otherwise for w1. The test can be accomplished equally well with a 
single linear function: 





Equivalently, we define J(w) as an error measure. A gradient descent method should be 
applied to minimize it. 
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gly) =w'y (5.41) 


defined as g(y) = g1(y) — g2(y). The so-called perceptron, graphically 
represented in Figure 5.8, is a computational structure that implements 
gly). The two possible classes are encoded in the output as ‘1’ and ‘—1’. 

A simple performance measure of a classifier is obtained by applying the 
training set to the classifier, and to count the samples that are erroneously 
classified. Obviously, such a performance measure — actually an error mea- 
sure — should be minimized. The disadvantage of this measure is that it is not a 
continuous function of y. Therefore, the gradient is not well defined. 

The performance measure of the perceptron is based on the following 
observation. Suppose that a sample y,, is misclassified. Thus, if the true 
class of the sample is w1, then g(y,,) = w’y,, is negative, and if the true 
class is w2, then g(y,,) = w’y,, is positive. In the former case we would 
like to correct w'y,, with a positive constant, in the latter case with a 
negative constant. We define Y,(w) as the set containing all w; samples 
in the training set that are misclassified, and Y2(w) as the set of all 
misclassified w) samples. Then: 


Jperceptron W) mE iD w'y + 5 w'y (5.42) 


yeY yeY 


This measure is continuous in w and its gradient is: 


VJ perceptron(W) =~ `> y+ D y (5.43) 


yeYi ye Y2 


Application of the gradient descent, see (5.40), gives the following 
learning rule: 


w(i+ 1) = w(i) -o(- T+ Dy) (5.44) 


yey yeYo 








W 
Zo o 
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Figure 5.8 The perceptron 
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where i is the iteration count. The iteration procedure stops when 
w(i+ 1) = w(i), i.e. when all samples in the training set are classified 
correctly. If such a solution exists, that is if the training set is linearly 
separable, the perceptron learning rule will find it. 

Instead of processing the full training set in one update step (so-called 
batch processing) we can also cycle through the training set and update 
the weight vector whenever a misclassified sample has been encountered 
(single sample processing). If y„ is a misclassified sample, then the 
learning rule becomes: 


w(it 1) = w(t) + Neny, (5.45) 


The variable c, is +1 if y, is a misclassified w; sample. If y, is a 
misclassified w sample, then c,, = —1. 

The error correction procedure of (5.45) can be extended to the multi- 
class problem as follows. Let gg(y) = wiy as before. We cycle through 
the training set and update the weight vectors w, and w; whenever a w 
sample is classified as class wj: 


Wi > Wi — Yn 


The procedure will converge in a finite number of iterations provided 
that the training set is linearly separable. Perceptron training is illus- 
trated in Example 5.5. 


Least squared error learning 


A disadvantage of the perceptron learning rule is that it only works well 
in separable cases. If the training set is not separable, the iterative 
procedure often tends to fluctuate around some value. The procedure 
is terminated at some arbitrary point, but it becomes questionable then 
whether the corresponding solution is useful. 

Non-separable cases can be handled if we change the performance 
measure such that its maximization boils down to solving a set of linear 
equations. Such a situation is created if we introduce a set of so-called 
target vectors. A target vector t, is a K-dimensional vector associated 
with the (augmented) sample y,,. Its value reflects the desired response of 
the discriminant function to y,,. The simplest one is place coding: 


1 if 6, = wk 
bip = me 5.47 
£ { 0 otherwise ( ) 
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(We recall that 0, is the class label of sample y,,.) This target function 
aims at a classification with minimum error rate. 
We now can apply a least squares criterion to find the weight vectors: 


Ns K 
Jis = 9° (wh yn — taa) (5.48) 


n=1 k=1 


The values of w; that minimize J;ş are the weight vectors of the least 
squared error criterion. 

The solution can be found by rephrasing the problem in a different nota- 
tion. Let Y = [ y4 eV be a Ns x (N+ 1) matrix, W = [w1 ...wg ]a 
(N +1) x K matrix, and T = [ty ... tn, |’ a Ns x K matrix. Then: 


Jus = IYW - T|? (5.49) 


where ||-||? is the Euclidean matrix norm, i.e. the sum of squared 
elements. The value of W that minimizes J;5 is the LS solution to the 
problem: 


Wis = (YTY) 'Y'T (5.50) 


Of course, the solution is only valid if (YTY)! exists. The matrix 
(YTY) tYT is the pseudo inverse of Y. See (3.25). 
An interesting target function is: 


tik = C(w,|An) (5.51) 


Here, t, embeds the cost that is involved if the assigned class is wg 
whereas the true class is 0„. This target function aims at a classification 
with minimal risk and the discriminant function g,(y) attempts to 
approximate the risk So C(w,|w;)P(w;ly) by linear LS fitting. The 
decision function in (5.35) should now involve a minimization rather 
than a maximization. 

Example 5.5 illustrates how the least squared error classifier can be 
found in PRTools. 


Example 5.5 Classification of mechanical parts, perceptron and 
least squared error classifier 

Decision boundaries for the mechanical parts example are shown in 
Figure 5.9(a) (perceptron) and Figure 5.9(b) (least squared error 
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Figure 5.9 Application of two linear classifiers. (a) Linear perceptron. (b) Least 
squared error classifier 


classifier). These plots were generated by the code shown in Listing 5.6. 
In PRTools, the linear perceptron classifier is implemented as per1c; 
the least squared error classifier is called fisherc. For perlc to find 
a good perceptron, the learning rate 7 had to be set to 0.01. Training 
was stopped after 1000 iterations. Interestingly, the least squared error 
classifier is not able to separate the data successfully, because the 
‘scrap’ class is not linearly separable from the other classes. 


Listing 5.6 
PRTools code for finding and plotting a linear perceptron and least 
squared error classifier on the mechanical parts data set. 


load nutsbolts; % Load the dataset 
w=perlc(z,1000,0.01); Train a linear perceptron 
figure; scatterd(z); plotc(w); 
w=fisherc(z); 

figure; scatterd(z); plotc(w); 


ae 





ae 


Traina LS error classifier 


5.3.4 The support vector classifier 


The basic support vector classifier is very similar to the perceptron. Both 
are linear classifiers, assuming separable data. In perceptron learning, 
the iterative procedure is stopped when all samples in the training set are 
classified correctly. For linearly separable data, this means that the found 
perceptron is one solution arbitrarily selected from an (in principle) 
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infinite set of solutions. In contrast, the support vector classifier chooses 
one particular solution: the classifier which separates the classes with 
maximal margin. The margin is defined as the width of the largest ‘tube’ 
not containing samples that can be drawn around the decision boundary; 
see Figure 5.10. It can be proven that this particular solution has the 
highest generalization ability. 

Mathematically, this can be expressed as follows. Assume we have 
training samples z„,n = 1,.., Ns (not augmented with an extra element) 
and for each sample a label c, € {1, —1}, indicating to which of the two 
classes the sample belongs. Then a linear classifier g(z) = wz + b is 
sought, such that: 


wayn+b>1 if a=+1 
for all n (5.52) 





Wintb<-1 if c=-1 
These two constraints can be rewritten into one inequality: 
n(w'z, +b) >1 (5.53) 
The gradient vector of g(z) is w. Therefore, the square of the margin is 
inversely proportional to ||w|| = w!w. To maximize the margin, we 


have to minimize ||w||*. Using Lagrange multipliers, we can incorporate 
the constraints (5.53) into the minimization: 


1 = 
L= 2 [wl + à Qn (cn [w" z, T b] = 1), anz 0 (5.54) 


n 


w’z+b=-1 w'z+b=0 w’z+b=+1 





Cy=-1 e ona 
support vectors O O 
® 
\ O 
@ 


margin 


Figure 5.10 The linear support vector classifier 
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L should be minimized with respect to w and b, and maximized with 


respect to the Lagrange multipliers an. Setting the partial derivates of L 
w.r.t. w and b to zero results in the constraints: 


Ns 
w= X aE; 
n=1 


x (5.55) 
D CnOn = 0 
n=1 
Resubstituting this into (5.54) gives the so-called dual form: 
Ns 1 Ns Ns 
L= Da Qn — pe `> Citi ts Zm, A, > 0 (5.56) 
n= n=1 m=1 


L should be maximized with respect to the a,. This is a quadratic 
optimization problem, for which standard software packages are avail- 
able. After optimization, the a, are used in (5.55) to find w. In typical 
problems, the solution is sparse, meaning that many of the a, become 0. 
Samples z, for which a, = 0 are not required in the computation of w. 
The remaining samples z, (for which a,, > 0) are called support vectors. 

This formulation of the support vector classifier is of limited use: it 
only covers a linear classifier for separable data. To construct nonlinear 
boundaries, discriminant functions, introduced in (5.39), can be applied. 
The data is transformed from the measurement space to a new feature 
space. This can be done efficiently in this case because in formulation 
(5.56) all samples are coupled to other samples by an inner product. For 
instance, when all polynomial terms up to degree 2 are used (as in 
(5.39)), we can write: 


y (Zn) y(Zm) = (222m +1)” = K(2n,Zm) (5.57) 
This can be generalized further: instead of (z!z,, + 1)* any integer degree 
(ZE Zm + 1)? with d > 1 can be used. Due to the fact that only the inner 
products between the samples are considered, the very expensive explicit 
expansion is avoided. The resulting decision boundary is a d-th degree 
polynomial in the measurement space. The classifier w cannot easily 
be expressed explicitly (as in (5.55)). However, we are only interested in the 
classification result. And this is in terms of the inner product between the 
object z to be classified and the classifier (compare also with (5.36)): 
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Ns 


g(z) = w'y(z) = X_ K(z, zn) (5.58) 


n=1 


Replacing the inner product by a more general kernel function is called 
the kernel trick. Besides polynomial kernels, other kernels have been 
proposed. The Gaussian kernel with oI as weighting matrix (the radial 
basis function kernel, RBF kernel) is frequently used in practice: 


Zn — Zm ||? 
K(2n,Zm) = exp (- cre) (5.59) 
For very small values of ø, this kernel gives very detailed boundaries, 
while for high values very smooth boundaries are obtained. 

In order to cope with overlapping classes, the support vector classifier 
can be extended to have some samples erroneously classified. For that, 
the hard constraints (5.53) are replaced by soft constraints: 


wiz,+b>1-& if c,=1 


5.60 
wizn+b<-14+& if c=-1 oe 


Here so-called slack variables €, > 0 are introduced. These should be 


minimized in combination with the w*. The optimization problem is 
thus changed into: 


1 Ns Ns 
L=5w + CA t2 aaa +b] -1+ £n) 


(5.61) 


Ns 
ot YnÊn, Qn, Yn = 0 
n=1 


The second term expresses our desire to have the slack variables as small 
as possible. C is a trade-off parameter that determines the balance 
between having a large overall margin at the cost of more erroneously 
classified samples, or having a small margin with less erroneously clas- 
sified samples. The last term holds the Lagrange multipliers that are 
needed to assure that €, > 0. 

The dual formulation of this problem is the same as (5.56). Its deriv- 
ation is left as an exercise for the reader; see exercise 6. The only 
difference is that an extra upper bound on a,, is introduced: a, < C. 
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It basically means that the influence of a single object on the description 
of the classifier is limited. This upper bound avoids that noisy objects 
with a very large weight completely determine the weight vector and 
thus the classifier. The parameter C has a large influence on the final 
solution, in particular when the classification problem contains over- 
lapping class distributions. It should be set carefully. Unfortunately, it is 
not clear beforehand what a suitable value for C will be. It depends on 
both the data and the type of kernel function which is used. No generally 
applicable number can be given. The only option in a practical applica- 
tion is to run cross-validation (Section 5.4) to optimize C. 

The support vector classifier has many advantages. A unique global 
optimum for its parameters can be found using standard optimization 
software. Nonlinear boundaries can be used without much extra com- 
putational effort. Moreover, its performance is very competitive with 
other methods. A drawback is that the problem complexity is not of the 
order of the dimension of the samples, but of the order of the number of 
samples. For large sample sizes (Ns > 1000) general quadratic program- 
ming software will often fail and special-purpose optimizers using 
problem-specific speedups have to be used to solve the optimization. 

A second drawback is that, like the perceptron, the classifier is basic- 
ally a two-class classifier. The simplest solution for obtaining a classifier 
with more than two classes is to train K classifiers to distinguish one 
class from the rest (similar to the place coding mentioned above). The 
classifier with the highest output w/z + b then determines the class label. 
Although the solution is simple to implement, and works reasonable 
well, it can lead to problems because the output value of the support 
vector classifier is only determined by the margin between the classes it is 
trained for, and is not optimized to be used for a confidence estimation. 
Other methods train K classifiers simultaneously, incorporating the one- 
class-against-the-rest labelling directly into the constraints of the optim- 
ization. This gives again a quadratic optimization problem, but the 
number of constraints increases significantly which complicates the 
optimization. 


Example 5.6 Classification of mechanical parts, support vector 
classifiers 

Decision boundaries found by support vector classifiers for the 
mechanical parts example are shown in Figure 5.11. These plots were 
generated by the code shown in Listing 5.7. In Figure 5.11(a), the 
kernel used was a polynomial one with degree d = 2 (a quadratic 
kernel); in Figure 5.11(b), it was a Gaussian kernel with a width 
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Figure 5.11 Application of two support vector classifiers. (a) Polynomial kernel, 
d = 2, C = 100. (b) Gaussian kernel, o = 0.1, C = 100 


o = 0.1. In both cases, the trade-off parameter C was set to 100; if it 
was set smaller, especially the support vector classifier with the poly- 
nomial kernel did not find good results. Note in Figure 5.11(b) how 
the decision boundary is built up by Gaussians around the support 
vectors, and so forms a closed boundary around the classes. 


Listing 5.7 
PRTools code for finding and plotting two different support vector 
classifiers. 


load nutsbolts; % Load the dataset 
weseve(z, DP nar 100) 7 

figure; scatterd(z); plotc(w); 
weve (2, rr? ,0.1,100) > 
figure; scatterd(z); plotc(w); 


ae 


Train a quadratic kernel svc 





ae 


Train a Gaussian kernel svc 


5.3.5 The feed-forward neural network 


A neural network extends the perceptron in another way: it combines 
the output of several perceptrons by another perceptron. A single per- 
ceptron is called a neuron in neural network terminology. Like a percep- 
tron, a neuron computes the weighted sum of the inputs. However, 
instead of a sign function, a more general transfer function is applied. 
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A transfer function used often is the sigmoid function, a continuous 
version of the sign function: 


1 
1+ exp(—w'y) 





gly) = f(w"y) (5.62) 


where y is the vector z augmented with a constant value 1. The vector 
w is called the weight vector, and the specific weight corresponding to 
the constant value 1 in z is called the bias weight. 

In principle, several layers of different numbers of neurons can be 
constructed. For an example, see Figure 5.12. Neurons which are not 
directly connected to the input or output are called hidden neurons. The 
hidden neurons are organized in (hidden) layers. If all neurons in the 
network compute their output based only on the output of neurons in 
previous layers, the network is called a feed-forward neural network. In 
a feed-forward neural network, no loops are allowed (neurons cannot 
get their input from next layers). 

Assume that we have only one hidden layer with H hidden neurons. 
The output of the total neural network is: 


H 
gely) =f (>: Ug pf (WEY) + nara) (5.63) 
b=1 


Here, w, is the weight vector of the inputs to hidden neuron h, and v; is 
the weight vector of the inputs to output neuron k. Analogous to least 


i) 
input hidden output 
layer layers layer 


Figure 5.12 A two-layer feed-forward neural network with two input dimensions 
and one output (for presentation purposes, not all connections have been drawn) 
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squared error fitting, we define a sum of squared errors between the 
output of the neural network and the target vector: 


tds Kk 2 
Ise = 5 NO (8klYn) — tne) (5.64) 
n=1 k=1 

The target vector is usually created by place coding: t„ = 1 if the label 
of sample y,, is wz, otherwise it is 0. However, as the sigmoid function 
lies in the range <0, 1>, the values 0 and 1 are hard to reach, and as a 
result the weights will grow very large. To prevent this, often targets are 
chosen that are easier to reach, e.g. 0.8 and 0.2. 

Because all neurons have continuous transfer functions, it is possible 
to compute the derivative of this error Js; with respect to the weights. 
The weights can then be updated using gradient descent. Using the chain 
rule, the updates of vz), are easy to compute: 


Ns /H 
Avpy = -ne =X (gel¥n) — tue) f (>. Vent (why) + rus) f(wpy) 


; n=1 hb=1 
(5.65) 


The derivation of the gradient with respect to wp; is more complicated: 





0 
Aw,; = -N AE 
i Ow, i 
K Ns f /H 
=X Y (89n) — tne) Veo why) vit (>. ve f (why) + ruse) 
k=1 n=1 hb=1 


(5.66) 


For the computation of equation (5.66) many elements of equation 
(5.65) can be reused. This also holds when the network contains more 
than one hidden layer. When the updates for vg p are computed first, and 
those for wp; are computed from that, we effectively distribute the error 
between the output and the target value over all weights in the network. 
We back-propagate the error. The procedure is called back-propagation 
training. 

The number of hidden neurons and hidden layers in a neural network 
controls how nonlinear the decision boundary can be. Unfortunately, it 
is hard to predict which number of hidden neurons is suited for the task 
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at hand. In practice, we often train a number of neural networks of 
varying complexity and compare their performance on an independent 
validation set. The danger of finding a too nonlinear decision boundary 
is illustrated in Example 5.7. 


Example 5.7 Classification of mechanical parts, neural networks 

Figure 5.13 shows the decision boundaries found by training two 
neural networks. The first network, whose decision boundaries is 
shown in Figure 5.13(a), contains one hidden layer of five units. This 
gives a reasonably smooth decision boundary. For the decision func- 
tion shown in Figure 5.13(c), a network was used with two hidden 
layers of 100 units each. This network has clearly found a highly 
nonlinear solution, which does not generalize as well as the first 
network. For example, the ‘*’ region (nuts) contains one outlying ‘x’ 
sample (scrap). The decision boundary bends heavily to include the 
single ‘x’ sample within the scrap region. Although such a crumpled 
curve decreases the squared error, it is undesirable because the outlying 
‘x’ is not likely to occur again at that same location in other realizations. 

Note also the spurious region in the right bottom of the plot, in 
which samples are classified as ‘bolt’ (denoted by + in the scatterplot). 
Here too, the network generalizes poorly as it has not seen any 
examples in this region. 

Figures 5.13(b) and (d) show the learn curves that were derived 
during training the network. One epoch is a training period in which 
the algorithm has cycled through all training samples. The figures 
show ‘error’, which is the fraction of the training samples that are 
erroneously classified, and ‘mse’, which is 2Js-/(KNs). The larger 
the network is, the more epochs are needed before the minimum will 
be reached. However, sometimes it is better to stop training before 
actually reaching the minimum because the generalization ability can 
degenerate in the vicinity of the minimum. 

The figures were generated by the code shown in Listing 5.8. 


Listing 5.8 
PRTools code for training and plotting two neural network classifiers. 


load nutsbolts; % Load the dataset 

[w, R] = bpxnc (2) 5, 500)> % Traina small 
network 

figure; scatterd(z); plotc (w); % Plot the 


classifier 
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oe 


figure; plotyy(R(:,1),R(:,2),R(:,1),R(:,4)); Plot the learn 
curves 

[w, R] =bpxne(z, [100 100],1000); % Traina larger 
network 

Plot the 
classifier 
Plot the learn 


curves 


oe 


figure; scatterd(z); plotc(w); 


oe 


figure; plotyy(R(:,1),R(:,2),R(:,1),R(:,4)); 


5.4 EMPIRICAL EVALUATION 


In the preceding sections various methods for training a classifier have 
been discussed. These methods have led to different types of classifiers 
and different types of learning rules. However, none of these methods 
can claim overall superiority above the other because their applicability 
and effectiveness is largely determined by the specific nature of the 
problem at hand. Therefore, rather than relying on just one method that 
has been selected at the beginning of the design process, the designer 
often examines various methods and selects the one that appears most 
suitable. For that purpose, each classifier has to be evaluated. 

Another reason for performance evaluation stems from the fact that 
many classifiers have their own parameters that need to be tuned. The 
optimization of a design criterion using only training data holds the risk 
of overfitting the design, leading to an inadequate ability to generalize. 
The behaviour of the classifier becomes too specific for the training data 
at hand, and is less appropriate for future measurement vectors coming 
from the same application. Particularly, if there are many parameters 
relative to the size of the training set and the dimension of the measure- 
ment vector, the risk of overfitting becomes large (see also Figure 5.13 
and Chapter 6). Performance evaluation based on a validation set (test 
set, evaluation set), independent from the training set, can be used as a 
stopping criterion for the parameter tuning process. 

A third motivation for performance evaluation is that we would like 
to have reliable specifications of the design anyhow. 

There are many criteria for the performance of a classifier. The prob- 
ability of misclassification, i.e. the error rate, is the most popular one. The 
analytical expression for the error rate as given in (2.16) is not very useful 
because, in practice, the conditional probability densities are unknown. 
However, we can easily obtain an estimate of the error rate by subjecting 
the classifier to a validation set. The estimated error rate is the fraction of 
misclassified samples with respect to the size of the validation set. 
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Figure 5.13 Application of two neural networks. (a) One hidden layer of five units 
(b) Learn curves of (a). (c) Two hidden layers of 100 units each. (d) Learn curves of (c) 


Where the classification uses the reject option we need to consider two 
measures: the error rate and the reject rate. These two measures are not 
fully independent because lowering the reject rate will increase the error 
rate. A plot of the error rate versus the reject rate visualizes this dependency. 

The error rate itself is somewhat crude as a performance measure. 
In fact, it merely suffices in the case of a uniform cost function. A more 
profound insight of the behaviour of the classifier is revealed by the 
so-called confusion matrix. The i, j-th element of this matrix is the count 
of w; samples in the validation set to which class wj is assigned. The 
corresponding PRTools function is confmat (). 

Another design criterion might be the computational complexity 
of a classifier. From an engineering point of view both the processor 
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capability and the storage capability are subjected to constraints. At the 
same time, the application may limit the computational time needed for 
the classification. Often, a trade-off must be found between computa- 
tional complexity and performance. In order to do so, we have to assess 
the number of operations per classification and the storage require- 
ments. These aspects scale with the number of elements N in the 
measurement vector, the number of classes K, and for some classifiers 
the size Ns of the training set. For instance, the number of operations of 
a quadratic machine is of the order of KN*, and so is its memory 
requirement. For binary measurements, the storage requirement is of 
the order of K2 (which becomes prohibitive even for moderate N). 

Often, the acquisition of labelled samples is laborious and costly. 
Therefore, it is tempting to use the training set also for evaluation 
purposes. However, the strategy of ‘testing from training data’ disguises 
the phenomenon of overfitting. The estimated error rate is likely to be 
strongly overoptimistic. To prevent this, the validation set should be 
independent of the training set. 

A straightforward way to accomplish this is the so-called holdout 
method. The available samples are randomly partitioned into a training 
set and a validation set. Because the validation set is used to estimate 
only one parameter, i.e. the error rate, and the training set is used to 
tune all other parameters of the classifier (which may be quite numer- 
ous), the training set must be larger than the validation set. As a rule of 
thumb, the validation set contains about 20% of the data, and the 
training set the remaining 80%. 

Suppose that a validation set consisting of, say, Nrtest labelled samples 
is available. Application of the ‘classifier-under-test’ to this validation 
set results in erro, misclassified samples. If the true error rate of the 
classifier is denoted by E, then the estimated error rate is: 





~ n 

E = error (5 .67) 
The random variable neror has a binomial distribution with parameters 
E and Nrest. Therefore, the expectation of E and its standard deviation is 


found as: 


=E 
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N Test N Test N Test 








OF 


Hence, E is an unbiased, consistent estimate of E. 
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Usually, the required number of samples Nes: is defined such that Eis 
within some specified margin around the true E with a prescribed prob- 
ability. It can be calculated from the posterior probability density 
P(E|Merrory NTest). See exercise 5. However, we take our ease by simply 
requiring that the relative uncertainty oj/E is equal to some fixed 
fraction y. Substitution in (5.68) and solving for Nes we obtain: 


1-E 


Va 
Test ZE 


(5.69) 


Figure 5.14 shows the required number of samples for different values of 
E such that the relative uncertainty is 10%. The figure shows that with 
E = 0.01 the number of samples must be about 10 000. 

The holdout method is not economic in handling the available data 
because only part of that data is used for training. A classifier trained 
with a reduced training set is expected to be inferior to a classifier 
trained with all available data. Particularly if the acquisition of labelled 
data is expensive, we prefer methods that use the data as much as 
possible for training and yet give unbiased estimates for the error rate. 
Examples of methods in this category are the cross-validation method 
and the leave-one-out method. These methods are computationally 
expensive, because they require the design of many classifiers. 

The cross-validation method randomly partitions the available data 
into L equally sized subsets T(4), = 1,..., L. First, the subset T(1) is 
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Figure 5.14 Required number of test samples 
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withheld and a classifier is trained using all the data from the remaining 
L—1 subsets. This classifier is tested using T(1) as validation data, 
yielding an estimate E(1). This procedure is repeated for all other sub- 
sets, thus resulting in L estimates E(£). The last step is the training of the 
final classifier using all available data. Its error rate is estimated as the 
average over E(¢). This estimate is a little pessimistic especially if L is small. 

The leave-one-out method is essentially the same as the cross- 
validation method except that now L is as large as possible, i.e. equal 
to Ns (the number of available samples). The bias of the leave-one-out 
method is negligible, especially if the training set is large. However, it 
requires the training of Ns + 1 classifiers and it is computationally very 
expensive if Ns is large. 
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5.6 EXERCISES 


1. Prove that if C is the covariance matrix of a random vector z, then 


1 
NC 


is the covariance matrix of the average of N realizations of z. (0) 

2. Show that (5.16) and (5.17) are equivalent. (*) 

3. Prove that, for the two-class case, (5.50) is equivalent to the Fisher linear discriminant 
(6.52). (xx) 

4. Investigate the behaviour (bias and variance) of the estimators for the conditional 


probabilities of binary measurements, i.e. (5.20) and (5.22), at the extreme ends. That 
is, if N; << 2N and N; >> 2N. (x*) 
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5. Assuming that the prior probability density of the error rate of a classifier is uniform 
between 0 and 1/K, give an expression of the posterior density p(E|Merrory NTest) 
where Nres is the size of an independent validation set and error is the number of 
misclassifications of the classifier. (+) 

6. Derive the dual formulation of the support vector classifier from the primal formula- 
tion. Do this by setting the partial derivatives of L to zero, and substituting the results 
in the primal function. (+) 

7. Show that the support vector classifiers with slack variables gives almost the same 
dual formulation as the one without slack variables (5.56). (*) 

8. Derive the neural network weight update rules (5.65) and (5.66). (x) 

9. Neural network weights are often initialized to random values in a small range, e.g. 
<—0.01, 0.01>. As training progresses, the weight values quickly increase. How- 
ever, the support vector classifier tells us that solutions with small norms of the 
weight vector have high generalization capability. What would be a simple way to 
assure that the network does not become too nonlinear? (*) 

10. Given the answer to exercise 9, what will be the effect of using better optimization 
techniques (such as second-order algorithms) in neural network training? Validate 
this experimentally using PRTools 1mnc function. (+) 


6 


Feature Extraction and 
Selection 


In some cases, the dimension N of a measurement vector z, i.e. the 
number of sensors, can be very high. In image processing, when raw 
image data is used directly as the input for a classification, the dimen- 
sion can easily attain values of 104 (a 100 x 100 image) or more. Many 
elements of z can be redundant or even irrelevant with respect to the 
classification process. 

For two reasons, the dimension of the measurement vector cannot be 
taken arbitrarily large. The first reason is that the computational com- 
plexity becomes too large. A linear classification machine requires in the 
order of KN operations (K is the number of classes; see Chapter 2). 
A quadratic machine needs about KN? operations. For a machine acting 
on binary measurements the memory requirement is on the order of 
K2N. This, together with the required throughput (number of classifica- 
tions per second), the state of the art in computer technology and the 
available budget define an upper bound to N. 

A second reason is that an increase of the dimension ultimately causes 
a decrease of performance. Figure 6.1 illustrates this. Here, we have a 
measurement space with the dimension N varying between 1 and 13. 
There are two classes (K = 2) with equal prior probabilities. The (true) 
minimum error rate Emin is the one which would be obtained if all class 
densities of the problem were fully known. Clearly, the minimum error 
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Figure 6.1 Error rates versus dimension of measurement space 


rate is a non-increasing function of the number of sensors. Once an 
element has been added with discriminatory information, the addition 
of another element cannot destroy this information. Therefore, with 
growing dimension, class information accumulates. 

However, in practice the densities are seldom completely known. 
Often, the classifiers have to be designed using a (finite) training set instead 
of using knowledge about the densities. In the example of Figure 6.1 
the measurement data is binary. The number of states a vector can take 
is 2N, If there are no constraints on the conditional probabilities, then 
the number of parameters to estimate is in the order of 2N. The number 
of samples in the training set must be much larger than this. If not, 
overfitting occurs and the trained classifier will become too much 
adapted to the noise in the training data. Figure 6.1 shows that if the 
size of the training set is Ns = 20, the optimal dimension of the mea- 
surement vector is about N = 4; that is where the error rate E is lowest. 
Increasing the sample size permits an increase of the dimension. With 
Ns = 80 the optimal dimension is about N = 6. 

One strategy to prevent overfitting, or at least to reduce its effect, has 
already been discussed in Chapter 5: incorporating more prior know- 
ledge by restricting the structure of the classifier (for instance, by an 
appropriate choice of the discriminant function). In the current chapter, 
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an alternative strategy will be discussed: the reduction of the dimension 
of the measurement vector. An additional advantage of this strategy is 
that it automatically reduces the computational complexity. 

For the reduction of the measurement space, two different approaches 
exist. One is to discard certain elements of the vector and to select the 
ones that remain. This type of reduction is feature selection. It is dis- 
cussed in Section 6.2. The other approach is feature extraction. Here, the 
selection of elements takes place in a transformed measurement space. 
Section 6.3 addresses the problem of how to find suitable transforms. 
Both methods rely on the availability of optimization criteria. These are 
discussed in Section 6.1. 


6.1 CRITERIA FOR SELECTION AND EXTRACTION 


The first step in the design of optimal feature selectors and feature 
extractors is to define a quantitative criterion that expresses how well 
such a selector or extractor performs. The second step is to do the actual 
optimization, i.e. to use that criterion to find the selector/extractor that 
performs best. Such an optimization can be performed either analytically 
or numerically. 

Within a Bayesian framework ‘best’ means the one with minimal risk. 
Often, the cost of misclassification is difficult to assess, or even fully 
unknown. Therefore, as an optimization criterion the risk is often 
replaced by the error rate E. Techniques to assess the error rate empiric- 
ally by means of a validation set are discussed in Section 5.4. However, 
in this section we need to be able to manipulate the criterion mathemat- 
ically. Unfortunately, the mathematical structure of the error rate is 
complex. The current section introduces some alternative, approximate 
criteria that are simple enough for a mathematical treatment. 

In feature selection and feature extraction, these simple criteria are 
used as alternative performance measures. Preferably, such performance 
measures have the following properties: 


e The measure increases as the average distance between the expecta- 
tion vectors of different classes increases. This property is based 
on the assumption that the class information of a measurement 
vector is mainly in the differences between the class-dependent 
expectations. 

e The measure decreases with increasing noise scattering. This prop- 
erty is based on the assumption that the noise on a measurement 
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vector does not contain class information, and that the class dis- 
tinction is obfuscated by this noise. 

e The measure is invariant to reversible linear transforms. Suppose 
that the measurement space is transformed to a feature space, i.e. 
y = Az with A an invertible matrix, then the measure expressed 
in the y space should be exactly the same as the one expressed in 
the z space. This property is based on the fact that both spaces carry 
the same class information. 

e The measure is simple to manipulate mathematically. Preferably, 
the derivatives of the criteria are obtained easily as it is used as an 
optimization criterion. 


From the various measures known in literature (Devijver and Kittler, 1982), 
two will be discussed. One of them — the interclass/intraclass distance 
(Section 6.1.1) — applies to the multi-class case. It is useful if class informa- 
tion is mainly found in the differences between expectation vectors in the 
measurement space, while at the same time the scattering of the measure- 
ment vectors (due to noise) is class-independent. The second measure — the 
Chernoff distance (Section 6.1.2) — is particularly useful in the two-class 
case because it can then be used to express bounds on the error rate. 

Section 6.1.3 concludes with an overview of some other performance 
measures. 


6.1.1 Inter/intra class distance 


The inter/intra distance measure is based on the Euclidean distance 
between pairs of samples in the training set. We assume that the class- 
dependent distributions are such that the expectation vectors of the 
different classes are discriminating. If fluctuations of the measurement 
vectors around these expectations are due to noise, then these fluctu- 
ations will not carry any class information. Therefore, our goal is to 
arrive at a measure that is a monotonically increasing function of the 
distance between expectation vectors, and a monotonically decreasing 
function of the scattering around the expectations. 

As in Chapter 5, Ts is a (labelled) training set with Ns samples. The 
classes w, are represented by subsets T, C Ts, each class having N, 
samples (£N; = Ns). Measurement vectors in Ts — without reference 
to their class — are denoted by z,,. Measurement vectors in T, (i.e. vectors 
coming from class wg) are denoted by Zg n- 
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The development starts with the following definition of the average 
squared distance of pairs of samples in the training set: 


dD 


Pam) we — Zm)" (Zn — Zm) (6.1) 


The summand (Zn — Zm)” (Zn — Zm) is the squared Euclidean distance 
between a pair of samples. The sum involves all pairs of samples in the 
training set. The divisor N4 accounts for the number of terms. 

The distance p2 is useless as a performance measure because none of 
the desired properties mentioned above are met. Moreover, p2 is defined 
without any reference to the labels of the samples. Thus, it does not give 
any clue about how well the classes in the training set can be discrimin- 
ated. To correct this, p? must be divided into a part describing the 
average distance between expectation vectors and a part describing 
distances due to noise scattering. For that purpose, estimations of the 
conditional expectations (U, = E[z|w,]) of the measurement vectors are 
used, along with an estimate of the unconditional expectation 
(u = E[z]). The sample mean of class wp is: 


1% 
f — EE A 
Uk N; 2 Zk n (6 ) 
The sample mean of the entire training set is: 
=Y Zn (6.3) 


With these definitions, it can be shown (see exercise 1) that the average 
squared distance is: 


— 2 Kf i Ka _ 
= Noe Seu = fy)" Zen — My) + (t= tty)” (Ô -i) (6.4) 


The first term represents the average squared distance due to the scatter- 
ing of samples around their class-dependent expectation. The second 
term corresponds to the average squared distance between class- 
dependent expectations and the unconditional expectation. 
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An alternative way to represent these distances is by means of scatter 
matrices. A scatter matrix gives some information about the dispersion 
of a population of samples around their mean. For instance, the matrix 
that describes the scattering of vectors from class wz, is: 


{2 


= FIL 2o (thee = Bie) (žen = He)! (6.5) 


n=1 


Sp 


Comparison with equation (5.14) shows that S is close to an unbiased 
estimate of the class-dependent covariance matrix. In fact, S, is the 
maximum likelihood estimate of C}. With that, S, does not only supply 
information about the average distance of the scattering, it also supplies 
information about the eccentricity and orientation of this scattering. 
This is analogous to the properties of a covariance matrix. 

Averaged over all classes the scatter matrix describing the noise is: 


1 £ 1 KM A wae 
Sw = 7D NeSk = DD (Zen — A) (Ben — Me) (6.6) 
S k=1 S k=1 n=1 


This matrix is the within-scatter matrix as it describes the average 
scattering within classes. Complementary to this is the between-scatter 
matrix S, that describes the scattering of the class-dependent sample 
means around the overall average: 


K 
Sp = Y5 NaC — A) r — A)T (6.7) 
Ns kzi 


Figure 6.2 illustrates the concepts of within-scatter matrices and 
between-scatter matrices. The figure shows a scatter diagram of a train- 
ing set consisting of four classes. A scatter matrix S corresponds to an 
ellipse, zS™'zT = 1, that can be thought of as a contour roughly sur- 
rounding the associated population of samples. Of course, strictly speak- 
ing the correspondence holds true only if the underlying probability 
density is Gaussian-like. But even if the densities are not Gaussian, the 
ellipses give an impression of how the population is scattered. In the 
scatter diagram in Figure 6.2 the within-scatter S, is represented by four 
similar ellipses positioned at the four conditional sample means. The 
between-scatter S, is depicted by an ellipse centred at the mixture sample 
mean. 
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Figure 6.2 The inter/intra class distance 


With the definitions in (6.5), (6.6) and (6.7) the average squared dis- 
tance in (6.4) is proportional to the trace of the matrix S + Sp; see (b.22): 


P = 2trace(S, + Sp) = 2trace(S,,) + 2trace(Sp) (6.8) 


Indeed, this expression shows that the average distance is composed of a 
contribution due to differences in expectation and a contribution due to 
noise. The term Jiyrgra = trace(S,,) is called the intraclass distance. The 
term Jinter = trace(S,) is the interclass distance. Equation (6.8) also 
shows that the average distance is not appropriate as a performance 
measure, since a large value of p? does not imply that the classes are well 
separated, and vice versa. 
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A performance measure more suited to express the separability of 
classes is the ratio between interclass and intraclass distance: 





Jinter _ trace(Sp) 
= 6.9 
Jintra — trace(Sw) Oe) 


This measure possesses some of the desired properties of a performance 
measure. In Figure 6.2, the numerator, trace(S,), measures the area of 
the ellipse associated with Sp. As such, it measures the fluctuations of the 
conditional expectations around the overall expectation, i.e. the fluctu- 
ations of the ‘signal’. The denominator, trace(S,,), measures the area of 
the ellipse associated with S,,. As such, it measures the fluctuations due 
to noise. Therefore, trace(S,)/trace(S,,) can be regarded as a ‘signal- 
to-noise ratio’. 

Unfortunately, the measure of (6.9) oversees the fact that the ellipse 
associated with the noise can be quite large, but without having a large 
intersection with the ellipse associated with the signal. A large S, can be 
quite harmless for the separability of the training set. One way to correct 
this defect is to transform the measurement space such that the within- 
scattering becomes white, i.e. S = I. For that purpose, we apply a linear 
operation to all measurement vectors yielding feature vectors y,, = AZ. 
In the transformed space the within- and between-scatter matrices 
become: AS,,A’ and ASA", respectively. The matrix A is chosen such 
that AS,A’ =I. 

The matrix A can be found by factorization: S = VAV” where A is a 
diagonal matrix containing the eigenvalues of S,,, and V a unitary matrix 
containing the corresponding eigenvectors; see appendix B.5. With this 
factorization it follows that A = A~"?V7. An illustration of the process 
is depicted in Figure 6.2. The operation V? performs a rotation that 
aligns the axes of Są. It decorrelates the noise. The operation A~!” scales 
the axes. The normalized within-scatter matrix corresponds with a circle 
with unit radius. In this transformed space, the area of the ellipse 
associated with the between-scatter is a useable performance measure: 


JINTER/INTRA = trace(A-?V'S, VA?) 
= trace(VA“!V'S,) (6.10) 
= trace(S,,'Sp) 


This performance measure is called the inter/intra distance. It meets all 
of our requirements stated above. 
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Example 6.1 Interclass and intraclass distance 

Numerical calculations of the example in Figure 6.2 show that before 
normalization Jiyrr4 = trace(S,) = 0.016 and Jinrer = trace(Sp) = 
0.54. Hence, the ratio between these two is Jmyrer/Jintra = 33-8. 
After normalization, Jpyrra = N = 2, Jinrer = Jintersintra = 59-1, 
and Jiter/Jintra = trace(Sp)/trace(S,) = 29.6. In this example, 
before normalization, the Jpyrer/Jintra Measure is too optimistic. 
The normalization accounts for this phenomenon. 


6.1.2 Chernoff-Bhattacharyya distance 


The interclass and intraclass distances are based on the Euclidean 
metric defined in the measurement space. Another possibility is to use 
a metric based on probability densities. Examples in this category are the 
Chernoff distance (Chernoff, 1952) and the Bhattacharyya distance 
(Bhattacharyya, 1943). These distances are especially useful in the two- 
class case. 

The merit of the Bhattacharyya and Chernoff distance is that an 
inequality exists with which the distances bound the minimum error 
rate Emin. The inequality is based on the following relationship: 


min{a, b} < vab (6.11) 


The inequality holds true for any positive quantities a and b. We will use 
it in the expression of the minimum error rate. Substitution of (2.15) in 


(2.16) yields: 


N f E AOR OT 
g (6.12) 


2 | min{ p(z|w1)P(w1),p (2lw2)P(wr)}dz 


Together with (6.11) we have the following inequality: 





Emin < VPP f Vp(elon)p(@lun)dz (6.13) 
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The inequality is called the Bhattacharyya upper bound. A more com- 
pact notation of it is achieved with the so-called Bhattacharyya distance. 
This performance measure is defined as: 


Janar = -h| | /plehea)plaloa)de (6.14) 
With that, the Bhattacharyya upper bound simplifies to: 


Emin < V P(w1)P(w2) exp(—Jenar) (6.15) 


The bound can be made more tight if inequality (6.11) is replaced with 
the more general inequality min{a, b} < asbt-s. This last inequality 
holds true for any s,a and b in the interval [0,1]. The inequality leads 
to the Chernoff distance, defined as: 


Jels) = —In if p'(z\w1)p'S(zlw2)dz| with:0<s<1 (6.16) 


Application of the Chernoff distance in a derivation similar to (6.12) 
yields: 


Emin < P(w) P(w)" exp(—Jc(s)) for any se [0,1] (6.17) 


The so-called Chernoff bound encompasses the Bhattacharyya upper 
bound. In fact, for s = 0.5 the Chernoff distance and the Bhattacharyya 
distance are equal: Jgpar = Jc(0.5). 

There also exists a lower bound based on the Bhattacharyya distance. 
This bound is expressed as: 


1 


5 [1 - V1 = 4P(1) Plea) exp(—2Jantar)| < Emin (6.18) 





A further simplification occurs when we specify the conditional 
probability densities. An important application of the Chernoff and 
Bhattacharyya distance is the Gaussian case. Suppose that these densities 
have class-dependent expectation vectors 4, and covariance matrices Cg, 
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respectively. Then, it can be shown that the Chernoff distance 
transforms into: 





(6.19) 





Jel) = 5 (1 — 5) (ay =m)" = x + Cx] "ty — my) 


1, | — s)C; + sCo| 
n 1-s s 
2 [C1] *|Ca| 





It can be seen that if the covariance matrices are independent of the 
classes, e.g. C4 = C2, the second term vanishes, and the Chernoff and the 
Bhattacharyya distances become proportional to the Mahalanobis distance 
SNR given in (2.46): Jzyar = SNR/8. Figure 6.3(a) shows the corresponding 
Chernoff and Bhattacharyya upper bounds. In this particular case, the 
relation between SNR and the minimum error rate is easily obtained using 
expression (2.47). Figure 6.3(a) also shows the Bhattacharyya lower bound. 

The dependency of the Chernoff bound on s is depicted in Figure 6.3(b). 
If Cy = Cy, the Chernoff distance is symmetric in s, and the minimum 
bound is located at s = 0.5 (i.e. the Bhattacharyya upper bound). If the 
covariance matrices are not equal, the Chernoff distance is not symmetric, 
and the minimum bound is not guaranteed to be located midway. A 
numerical optimization procedure can be applied to find the tightest bound. 

If in the Gaussian case, the expectation vectors are equal (44 = 4h), 
the first term of (6.19) vanishes, and all class information is represented 
by the second term. This term corresponds to class information carried 
by differences in covariance matrices. 


(a) (b) 
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Chernoff bound (s= 0.5) 
= Bhattacharyya upper bound | SNR=10 
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Figure 6.3 Error bounds and the true minimum error for the Gaussian case 
(Ci = C2). (a) The minimum error rate with some bounds given by the Chernoff 
distance. In this example the bound with s = 0.5 (Bhattacharyya upper bound) is the 
most tight. The figure also shows the Bhattacharyya lower bound. (b) The Chernoff 
bound with dependence on s 
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6.1.3 Other criteria 


The criteria discussed above are certainly not the only ones used in 
feature selection and extraction. In fact, large families of performance 
measures exist; see Devijver and Kittler (1982) for an extensive over- 
view. One family is the so-called probabilistic distance measures. These 
measures are based on the observation that large differences between the 
conditional densities p(z|w,) result in small error rates. Let us assume a 
two-class problem with conditional densities p(z|w1) and p(z|w2). Then, 
a probabilistic distance takes the following form: 


J= fs p(z|w1), p(z|w»))dz (6.20) 


The function g(-,- ) must be such that J is zero when p(z|w1) = p(z|w2), 
Yz, and non-negative otherwise. In addition, we require that J attains its 
maximum whenever the two densities are completely non-overlapping. 
Obviously, the Bhattacharyya distance (6.14) and the Chernoff distance 
(6.16) are examples of performance measures based on a probabilistic 
distance. Other examples are the Matusita measure and the divergence 
measures: 








Juarusrra = i J (play - voeo) de (621) 


Jorvercence = f (p(aher) = plee) In PE da (6.22) 





These measures are useful for two-class problems. For more classes, a 
measure is obtained by taking the average of pairs. That is: 


K K 
J=X Y Poe) Pens (6.23) 


where Jọ; is the measure found between the classes wg and uy. 

Another family is the one using the probabilistic dependence of the 
measurement vector z on the class wp. Suppose that the conditional 
density of z is not affected by wy, i.e. p(z|wą) = p(z), Yz, then an observa- 
tion of z does not increase our knowledge on wz. Therefore, the ability of 
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z to discriminate class wą from the rest can be expressed in a measure of 
the probabilistic dependence: 


I ” g(p(zlug), p(2))az (6.24) 


Co 


where the function g( -,- ) must have likewise properties as in (6.20). In 
order to incorporate all classes, a weighted sum of (6.24) is formed to get 
the final performance measure. As an example, the Chernoff measure 
now becomes: 


K 


Sc-aep(s) = -S5 Pln) f (clone de (6.25) 


k=1 


Other dependence measures can be derived from the probabilistic dis- 
tance measures in likewise manner. 

A third family is founded on information theory and involves the 
posterior probabilities P(w,|z). An example is Shannon’s entropy meas- 
ure. For a given z, the information of the true class associated with z is 
quantified by Shannon by entropy: 





K 
H(z) = — X` P(wglz) logy P(welz) (6.26) 
k=1 
Its expectation 
Isnanwow = EH] = | H@)pte)az (6.27) 


is a performance measure suitable for feature selection and extraction. 


6.2 FEATURE SELECTION 


This section introduces the problem of selecting a subset from the N- 
dimensional measurement vector such that this subset is most suitable 
for classification. Such a subset is called a feature set and its elements 
are features. The problem is formalized as follows. Let F(N) = 
{Zn|2 = 0,...,N — 1} be the set with elements from the measurement 
vector z. Furthermore, let F;(D) = {yq|d = 0,...,D — 1} be a subset of 
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F(N) consisting of D < N elements taken from z. For each element yg 
there exists an element z, such that yg = Zn. The number of distinguish- 
able subsets for a given D is: 


N N! 
q(D) = (>) =N DDI (6.28) 


This quantity expresses the number of different combinations that can 
be made from N elements, each combination containing D elements and 
with no two combinations containing exactly the same D elements. We 
will adopt an enumeration of the index j according to: j = 1,...,q(D). 

In Section 6.1 the concept of a performance measure has been intro- 
duced. These performance measures evaluate the appropriateness of a 
measurement vector for the classification task. Let J(F;(D)) be a perform- 
ance measure related to the subset F;(D). The particular choice of 
J(.) depends on the problem at hand. For instance, in a multi-class 
problem, the interclass/intraclass distance Jiyrer/inrra(-) Could be useful; 
see (6.10). 

The goal of feature selection is to find a subset F(D) with dimension D 
such that this subset outperforms all other subsets with dimension D: 


F(D) =F(D) with: J(Fi(D)) >J(F(D)) forall je {1,...,q(D)} 
(6.29) 


An exhaustive search for this particular subset would solve the problem of 
feature selection. However, there is one practical problem. How to accom- 
plish the search process? An exhaustive search requires g(D) evaluations of 
J(F;(D)). Even in a simple classification problem, this number is enormous. 
For instance, let us assume that N = 20 and D = 10. This gives about 
2 x 10° different subsets. A doubling of the dimension to N = 40 requires 
about 10? evaluations of the criterion. Needless to say that even in problems 
with moderate complexity, an exhaustive search is out of the question. 
Obviously, a more efficient search strategy is needed. For this, many 
options exist, but most of them are suboptimal (in the sense that they 
cannot guarantee that the subset with the best performance will be 
found). In the remaining part of this section, we will first consider a 
search strategy called ’branch-and-bound’. This strategy is one of the 
few that guarantees optimality (under some assumptions). Next, we 
continue with some of the much faster, but suboptimal strategies. 
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6.2.1 Branch-and-bound 


The search process is accomplished systematically by means of a tree 
structure. See Figure 6.4. The tree consists of N — D + 1 levels. Each 
level is enumerated by a variable n with n varying from D up to N. 
A level consists of a number of nodes. A node i at level n corresponds to 
a subset F;(n). At the highest level, n= N, there is only one node 
corresponding to the full set F(N). At the lowest level n = D there are 
q(D) nodes corresponding to the g(D) subsets among which the solution 
must be found. Levels in between have a number of nodes that is less 
than or equal to q(n). A node at level n is connected to one node at level 
n + 1 (except the node F(N) at level N). In addition, each node at level n 
is connected to one or more nodes at level n — 1 (except the nodes at 
level D). In the example of Figure 6.4, N = 6 and D = 2. 

A prerequisite of the branch-and-bound strategy is that the perform- 
ance measure is a monotonically increasing function of the dimension D. 
The assumption behind it is that if we remove one element from a 
measurement vector, the performance can only become worse. In prac- 
tice, this requirement is not always satisfied. If the training set is finite, 
problems in estimating large numbers of parameters may result in an 
actual performance increase when fewer measurements are used (see 
Figure 6.1). 

The search process takes place from the highest level (n = N) by 
systematically traversing all levels until finally the lowest level (n = D) 
is reached. Suppose that one branch of the tree has been explored up 
to the lowest level, and suppose that the best performance measure 
found so far at level n= D is J. Consider a node F;,(/) (at a level 
n > D) which has not been explored as yet. Then, if J(F;(n)) < J, it is 


n=N=5 Z0Z4Z0Z3Z4 
1 2 3 4 
n=4 Z4ZoZ3Z4 ZoZ2Z3Z4 ZoZ41Z3Z4 ZoZ1Z2Z4 ZoZ1Z2Z3 
1/2 NM 2\3 3 4 
pas 252324 242324 242Z0Z4 242023 292324 2ZoZ0Z4 292023 202124 292123 292120 
2 N34 a SEO 
n=D=2 2324 2024 ZZ) ZZ4 ZZ% ZZ% ZoZ4 ZoZ3 ZọZ2 ZoZ 


Figure 6.4 A top-down tree structure in behalf of feature selection 
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unnecessary to consider the nodes below F;(n). The reason is that the 
performance measure for these nodes can only become less than J(F;(7)) 
and thus less than J. 

The following algorithm traverses the tree structure according to a 
depth first search with a backtrack mechanism. In this algorithm the 
best performance measure found so far at level n = D is stored in a 
variable J. Branches whose performance measure is bounded by J are 
skipped. Therefore, relative to exhaustive search the algorithm is much 
more computationally efficient. The variable $ denotes a subset. Initially 
this subset is empty, and J is set to 0. As soon as a combination of D 
elements has been found that exceeds the current performance, this 
combination is stored in S and the corresponding measure in J. 


Algorithm 6.1: Branch-and-bound search 
Input: a labelled training set on which a performance measure J() is 


defined. 


1. Initiate: J = 0 and S = ¢; 
2. Explore-node(F;(N)); 


Output: The maximum performance measure stored in J with the asso- 
ciated subset of F,(N) stored in S. 


Procedure: Explore-node(F;(1)) 


1. If (J(F;(n)) < J) then return; 
2. If (n = D) then i 
2.1. If (J(Fi(n)) > J) 

2.1.1. J =J(Fi(n))s 
2.1.2. S = Fi(n); 
2.1.3. return; 
3. For all (F;(n — 1) C F;(n)) do Explore-node(Fj(n — 1)); 
4. return; 


The algorithm is recursive. The procedure ‘Explore-node()’ explores the 
node given in its argument list. If the node is not a leaf, all its children 
nodes are explored by calling the procedure recursively. The first call is 
with the full set F(N) as argument. 

The algorithm listed above does not specify the exact structure of the 
tree. The structure follows from the specific implementation of the loop 
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in step 3. This loop also controls the order in which branches are 
explored. Both aspects influence the computational efficiency of the 
algorithm. In Figure 6.4, the list of indices of elements that are deleted 
from a node to form the child nodes follows a specific pattern (see 
exercise 2). The indices of these elements are shown in the figure. Note 
that this tree is not minimal because some twigs have their leaves at a 
level higher than D. Of course, pruning these useless twigs in advance is 
computationally more efficient. 


6.2.2 Suboptimal search 


Although the branch-and-bound algorithm can save many calculations 
relative to an exhaustive search (especially for large values of g(D)), it 
may still require too much computational effort. 

Another possible defect of branch-and-bound is the top-down search 
order. It starts with the full set of measurements and successively deletes 
elements. However, the assumption that the performance becomes worse 
as the number of features decreases holds true only in theory; see Figure 
6.1. In practice, the finite training set may give rise to an overoptimistic 
view if the number of measurements is too large. Therefore, a bottom-up 
search order is preferable. The tree in Figure 6.5 is an example. 

Among the many suboptimal methods, we mention a few. A simple 
method is sequential forward selection (SFS). The method starts at the 
bottom (the root of the tree) with an empty set and proceeds its way to 
the top (a leaf) without backtracking. At each level of the tree, SFS adds 
one feature to the current feature set by selecting from the remaining 
available measurements the element that yields a maximal increase of 
the performance. A disadvantage of the SFS is that once a feature is 


nl=D=2 ZZ ZZ ZZ} ZZ, ë Z%4Z ēě ZZ} ZZ4 ZZ} ZZ4 ZZ 
-S XY / / 
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Figure 6.5 A bottom-up tree structure for feature selection 
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included in the selected sets, it cannot be removed, even if at a higher 
level in the tree, when more features are added, this feature would be less 
useful. 

The SFS adds one feature at a time to the current set. An improvement 
is to add more than one, say /, features at a time. Suppose that at a given 
stage of the selection process we have a set F;(n) consisting of n features. 
Then, the next step is to expand this set with / features taken from the 
remaining N-n measurements. For that we have (N -n)!/ 
((N —2-— I)! I!) combinations which all must be checked to see which 
one is most profitable. This strategy is called the generalized sequential 
forward selection (GSFS(I)). 

Both the SFS and the GSFS(/) lack a backtracking mechanism. Once a 
feature has been added, it cannot be undone. Therefore, a further 
improvement is to add / features at a time, and after that to dispose of 
some of the features from the set obtained so far. Hence, starting with a 
set of F;(n) features we select the combination of l remaining measure- 
ments that when added to F;(n) yields the best performance. From this 
expanded set, say F;(n + 1), we select a subset of r features and remove 
it from F;(n + 1) to obtain a set, say F,(2 + l — r). The subset of r features 
is selected such that F,(7 + l — r) has the best performance. This strategy 
is called ‘Plus | — take away r selection’. 


Example 6.2 Character classification for license plate recognition 
In traffic management, automated license plate recognition is useful 
for a number of tasks, e.g. speed checking and automated toll collec- 
tion. The functions in a system for license plate recognition include 
image acquisition, vehicle detection, license plate localization, etc. 
Once the license plate has been localized in the image, it is partitioned 
such that each part of the image — a bitmap — holds exactly one 
character. The classes include all alphanumerical values. Therefore, 
the number of classes is 36. Figure 6.6(a) provides some examples of 
bitmaps obtained in this way. 

The size of the bitmaps ranges from 15 x 6 up to 30 x 15. In order 
to fix the number of pixels, and also to normalize the position, scale 
and contrasts of the character within the bitmap, all characters are 
normalized to 15 x 11. 

Using the raw image data as measurements, the number of mea- 
surements per object is 15 x 11 = 165. A labelled training set con- 
sisting of about 70000 samples, i.e. about 2000 samples/class, is 
available. Comparing the number of measurements against the num- 
ber of samples/class we conclude that overfitting is likely to occur. 
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Figure 6.6 Character classification for license plate recognition. (a) Character sets 
from license plates, before and after normalization. (b) Selected oe The num- 
ber of features is 18 and 50 respectively 





A feature selection procedure, based on the ‘plus /-take away r 
method (/ = 3, r = 2) and the inter/intra distance (Section 6.1.1) gives 
feature sets as depicted in Figure 6.6(b). Using a validation set con- 
sisting of about 50000 samples, it was established that 50 features 
gives the minimal error rate. A number of features above 50 intro- 
duces overfitting. The pattern of 18 selected features, as shown in 
Figure 6.6(b), is one of the intermediate results that were obtained to 
get the optimal set with 50 features. It indicates which part of a 
bitmap is most important to recognize the character. 


6.2.3 Implementation issues 


PRTools offers a large range of feature selection methods. The 
evaluation criteria are implemented in the function feateval, and 
are basically all inter/intra cluster criteria. Additionally, a «-nearest 
neighbour classification error is defined as a criterion. This will give 
a reliable estimate of the classification complexity of the reduced 
data set, but can be very computationally intensive. For larger data 
sets it is therefore recommended to use the simpler inter-intra-cluster 
measures. 

PRTools also offers several search strategies, i.e. the branch-and- 
bound algorithm, plus-l-takeaway-r, forward selection and backward 
selection. Feature selection mappings can be found using the function 
featselm. The following listing is an example. 
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Listing 6.1 
PRTools code for performing feature selection. 


ae oe 


Create a labeled dataset with 8 features, of which only 2 
are useful, and apply various feature selection methods 
z=gendatd(200,8,3,3); 


w=featselm(z, ‘maha-s’, ‘forward’,2); % Forward selection 
figure; clf; scatterd(z*w); 

title([’forward: ' num2str(+w{2})]); 

w=featselm(z,‘maha-s’, ‘backward’ ,2); % Backward selection 
figure; clf; scatterd(z*w); 

title([’backward: ’ num2str(+w{2})]); 

w=featselm(z, ‘maha-s’, ‘b&b’,2); % B&B selection 
figure; clf; scatterd(z*w) ; 

title([’b&b: ' num2str(+w{2})]); 











The function gendatd creates a data set in which just the first two 
measurements are informative while all other measurements only contain 
noise (there the classes completely overlap). The listing shows three pos- 
sible feature selection methods. All of them are able to retrieve the correct 
two features. The main difference is in the required computing time: 
finding two features out of eight is approached most efficiently by the 
forward selection method, while backward selection is the most inefficient. 


6.3 LINEAR FEATURE EXTRACTION 


Another approach to reduce the dimension of the measurement vector is 
to use a transformed space instead of the original measurement space. 
Suppose that W(.) is a transformation that maps the measurement space 
RN onto a reduced space R?, D < N. Application of the transformation 
to a measurement vector yields a feature vector y € RP: 


y = W(z) (6.30) 


Classification is based on the feature vector rather than on the measure- 
ment vector; see Figure 6.7. 

The advantage of feature extraction above feature selection is that no 
information from any of the elements of the measurement vector needs 
to be wasted. Furthermore, in some situations feature extraction is easier 
than feature selection. A disadvantage of feature extraction is that it 
requires the determination of some suitable transformation W(). If the 


LINEAR FEATURE EXTRACTION 203 









measurement feature assigned 
vector feature extraction vector pattern classification class 
pD D p 
W() yer ol) DEQ 














Figure 6.7 Feature extraction 


transform chosen is too complex, the ability to generalize from a small 
data set will be poor. On the other hand, if the transform chosen is too 
simple, it may constrain the decision boundaries to a form which is 
inappropriate to discriminate between classes. Another disadvantage is 
that all measurements will be used, even if some of them are useless. This 
might be unnecessarily expensive. 

This section discusses the design of linear feature extractors. The 
transformation W() is restricted to the class of linear operations. Such 
operations can be written as a matrix-vector product: 


y= Wz (6.31) 


where W is a D x N matrix. The reason for the restriction is threefold. 
First, a linear feature extraction is computationally efficient. Second, in 
many classification problems — though not all — linear features are 
appropriate. Third, a restriction to linear operations facilitates the math- 
ematical handling of the problem. 

An illustration of the computational efficiency of linear feature extraction 
is the Gaussian case. If covariance matrices are unequal, the number of 
calculations is on the order of KN?; see equation (2.20). Classification based 
on linear features requires about DN + KD? calculations. If D is very small 
compared with N, the extraction saves a large number of calculations. 

The example of Gaussian densities is also well suited to illustrate the 
appropriateness of linear features. Clearly, if the covariance matrices 
are equal, then (2.25) shows that linear features are optimal. On the 
other hand, if the expectation vectors are equal and the discriminatory 
information is in the differences between the covariance matrices, 
linear feature extraction may still be appropriate. This is shown in 
the example of Figure 2.10(b) where the covariance matrices are 
eccentric, differing only in their orientations. However, in the example 
shown in Figure 2.10(a) (concentric circles) linear features seem to be 
inappropriate. In practical situations, the covariance matrices will 
often differ in both shape and orientations. Linear feature extraction 
is likely to lead to a reduction of the dimension, but this reduction may 
be less than what is feasible with nonlinear feature extraction. 


204 FEATURE EXTRACTION AND SELECTION 


Linear feature extraction may also improve the ability to generalize. If, in 
the Gaussian case with unequal covariance matrices, the number of sam- 
ples in the training set is in the same order as that of the number 
of parameters, KN’, overfitting is likely to occur. But linear feature extrac- 
tion — provided that D << N — helps to improve the generalization ability. 

We assume the availability of a training set and a suitable performance 
measure J(). The design of a feature extraction method boils down to 
finding the matrix W that — for the given training set — optimizes the 
performance measure. 

The performance measure of a feature vector y = Wz is denoted by 
J(y) or J(Wz). With this notation, the optimal feature extraction is: 


W= atemat|CWz) } (6.32) 


Under the condition that J(Wz) is continuously differentiable in W, the 
solution of (6.32) must satisfy: 


aN) =0 (6.33) 
ow 

Finding a solution of either (6.32) or (6.33) gives us the optimal linear 

feature extraction. The search can be accomplished numerically using 

the training set. Alternatively, the search can also be done analytically 

assuming parameterized conditional densities. Substitution of estimated 

parameters (using the training set) gives the matrix W. 

In the remaining part of this section, the last approach is worked out 
for two particular cases: feature extraction for two-class problems with 
Gaussian densities and feature extraction for multi-class problems based 
on the inter/intra distance measure. The former case will be based on the 
Bhattacharyya distance. 


6.3.1 Feature extraction based on the Bhattacharyya distance 
with Gaussian distributions 


In the two-class case with Gaussian conditional densities a suitable 
performance measure is the Bhattacharyya distance. In equation (6.19) 
Jeuatr implicitly gives the Bhattacharyya distance as a function of the 
parameters of the Gaussian densities of the measurement vector z. These 
parameters are the conditional expectations 4, and covariance matrices 
C}. Substitution of y = Wz gives the expectation vectors and covariance 
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matrices of the feature vector, i.e. Wu, and WC,W’, respectively. For 
the sake of brevity, let m be the difference between expectations of z: 


Then, substitution of m, Wu, and WC,W! in (6.19) gives the Bhatta- 
charyya distance of the feature vector: 


1 = 
Jsuar( W2) = 4 (Wm)” [WC W" + Wcw] 'Wm 


1, | Waw? + wo.w"| o>) 


2 [2?\/iweiw"||wo.w"| 








The first term corresponds to the discriminatory information of the 
expectation vectors; the second term to the discriminatory information 
of covariance matrices. 

Equation (6.35) is in such a form that an analytic solution of (6.32) is 
not feasible. However, if one of the two terms in (6.35) is dominant, a 
solution close to the optimal one is attainable. We consider the two 
extreme situations first. 


Equal covariance matrices 


In the case where the conditional covariance matrices are equal, i.e. 
C=C; = C2, we have already seen that classification based on the 
Mahalanobis distance (see (2.41)) is optimal. In fact, this classification 
uses a 1 x N dimensional feature extraction matrix given by: 


W=m'C! (6.36) 


To prove that this equation also maximizes the Bhattacharyya distance is 
left as an exercise for the reader. 
Equal expectation vectors 


If the expectation vectors are equal, m = 0, the first term in (6.35) 
vanishes and the second term simplifies to: 


1 WCW! + WCW" | 
Jsuar(Wz) = zin 
2>\/|WC:W" ||WOW"| 





(6.37) 
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The D x N matrix W that maximizes the Bhattacharyya distance can be 
derived as follows. 

The first step is to apply a whitening operation (Appendix C.3.1) on z 
with respect to class w1. This is accomplished by the linear operation 
A-"2VTz. The matrices V and A follow from factorization of the covari- 
ance matrix: Cı = VAV!. V is an orthogonal matrix consisting of the 
eigenvectors of C1. A is the diagonal matrix containing the correspond- 
ing eigenvalues. The process is illustrated in Figure 6.8. The figure shows 


(a) (b) 


scatter diagram after whitening of the first class 









































Figure 6.8 Linear feature extraction with equal expectation vectors. (a) Covariance 
matrices with decision function. (b) Whitening of w; samples. (c) Decorrelation of w2 
samples. (d) Decision function based on one linear feature 
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a two-dimensional measurement space with samples from two classes. 
The covariance matrices of both classes are depicted as ellipses. 
Figure 6.8(b) shows the result of the operation A~!2V". The operation 
V” corresponds to a rotation of the coordinate system such that the 
ellipse of class w; lines up with the axes. The operation A~!? corres- 
ponds to a scaling of the axes such that the ellipse of wı degenerates into 
a circle. The figure also shows the resulting covariance matrix belonging 
to class w2. 

The result of the operation A~!/2V" on z is that the covariance matrix 
associated with w; becomes I and the covariance matrix associated with 
w2 becomes A~!2V'C,VA-!2. The Bhattacharyya distance in the trans- 
formed domain is: 


—1\,T eal 
Jeuar(A VTZ) = sin II + AZV CVA 2| 
2N4/ |A 2VTC2VA =] 


The second step consists of decorrelation with respect to w2. Suppose 
that U and T are matrices containing the eigenvectors and eigenvalues of 
the covariance matrix A~!?V'C,;VA~'?. Then, the operation 
U'A~!2V" decorrelates the covariance matrix with respect to class w2. 
The covariance matrices belonging to the classes w4 and w2 transform 
into U'IU = I and T, respectively. Figure 6.8(c) illustrates the decorrel- 
ation. Note that the covariance matrix of w1 (being white) is not affected 
by the orthonormal operation U7. 

The matrix T is a diagonal matrix. The diagonal elements are denoted 
qi = I;i. In the transformed domain UTA 12" 7, the Bhattacharyya 
distance is: 


(6.38) 





JI+T| IN í 1 ) 
= l iJ 6.39 
2N AT] 22, nag) ee 


The expression shows that in the transformed domain the contribution 
to the Bhattacharyya distance of any element is independent. The con- 
tribution of the i-th element is: 





1 
Jenar(UTA VTZ) = 3 In 











xing (va ) (6.40) 
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Therefore, if in the transformed domain D elements have to be selected, 
the selection with maximum Bhattacharyya distance is found as the set 
with the largest contributions. If the elements are sorted according to: 





1 
Vib t+ 2 V+ 22 Vit (6.41) 


Z v 
v YO yı Yn-1 
then the first D elements are the ones with optimal Bhattacharyya 
distance. Let Up be an N x D submatrix of U containing the D corres- 


onding eigenvectors of A~!2V'C,VA~!2. The optimal linear feature 
Pp 8 eg P 
extractor is: 


W = ULA VT (6.42) 


and the corresponding Bhattacharyya distance is: 





Jenar( Wz) -15 h3 1 VI ṣ =) (6.43) 


Figure 6.8(d) shows the decision function following from linear feature 
extraction backprojected in the two-dimensional measurement space. 
Here, the linear feature extraction reduces the measurement space to a 
one-dimensional feature space. Application of Bayes classification in this 
space is equivalent to a decision function in the measurement space defined 
by two linear, parallel decision boundaries. In fact, the feature extraction is 
a projection onto a line orthogonal to these decision boundaries. 


The general Gaussian case 


If both the expectation vectors and the covariance matrices depend on 
the classes, an analytic solution of the optimal linear extraction problem 
is not easy. A suboptimal method, i.e. a method that hopefully yields 
a reasonable solution without the guarantee that the optimal solution 
will be found, is the one that seeks features in the subspace defined by 
the differences in covariance matrices. For that, we use the same simul- 
taneous decorrelation technique as in the previous section. The Bhatta- 
charyya distance in the transformed domain is: 


45d ns (Vit) 644) 
4aylity 2m 2 vii 





Toar UAN p= 
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where d; are the elements of the transformed difference of expectation, 
ie. d = UlA7!2V'm. 

Equation (6.44) shows that in the transformed space the optimal 
features are the ones with the largest contributions, i.e. with the largest 


1d@ 1,1 1 
fx (V+) (6.45) 


The extraction method is useful especially when most class information 
is contained in the covariance matrices. If this is not the case, then the 
results must be considered cautiously. Features that are appropriate for 
differences in covariance matrices are not necessarily also appropriate 
for differences in expectation vectors. 





Listing 6.2 
PRTools code for calculating a Bhattacharryya distance feature extractor. 


z=gendatl([200 200],0.2); 
J=bhatm(z,0); 

figure; clf; plot(J, ‘r.-'); 
w=bhatm(z,1); 

figure; clf; scatterd(z); 
figure; clf; scatterd(z*w); 


ae 


Generate a dataset 
Calculate criterion values 
and plot them 


ae oe 


ae 


Extract one feature 
Plot original data 
Plot mapped data 


ae 





ae 


6.3.2 Feature extraction based on inter/intra class distance 


The inter/intra class distance, as discussed in Section 6.1.1, is another 
performance measure that may yield suitable feature extractors. The 
starting point is the performance measure given in the space defined by 
y = A-!?vV"z. Here, A is a diagonal matrix containing the eigenvalues of 
Sw, and V a unitary matrix containing the corresponding eigenvectors. In 
the transformed domain the performance measure is expressed as (6.10): 


JINTER/INTRA = trace (A-2V7S, VA?) (6.46) 


A further simplification occurs when a second unitary transform is 
applied. The purpose of this transform is to decorrelate the between- 
scatter matrix. Suppose that I is a diagonal matrix whose diagonal 
elements 7; = T; ; are the eigenvalues of the transformed between-scatter 
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matrix A~'?V"S,VA~'?, Let U be a unitary matrix containing the 
eigenvectors corresponding to I. Then, in the transformed domain 


defined by: 
y = U"A Vz (6.47) 
the performance measure becomes: 


N-1 


JINTER/INTRA = trace(T) = ~~ i (6.48) 
i=0 


The operation U” corresponds to a rotation of the coordinate system 
such that the between-scatter matrix lines up with the axes. Figure 6.9 
illustrates this. 

The merit of (6.48) is that the contributions of the elements add up 
independently. Therefore, in the space defined by y = UA~!?V'z it is 
easy to select the best combination of D elements. It suffices to determine 
the D elements from y whose eigenvalues y; are largest. Suppose that the 
eigenvalues are sorted according to 7; > 1, and that the eigenvectors 


(a) (b) 


after simultaneous decorrelation classification with 1 linear feature 








projections 

















101 0 02 04 06 08 1 


Figure 6.9 Feature extraction based on the interclass/intraclass distance (see 
Figure 6.2). (a) The within and between scatters after simultaneous decorrelation. 
(b) Linear feature extraction 
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corresponding to the D largest eigenvalues are collected in Up, being an 
N x D submatrix of U. Then, the linear feature extraction becomes: 


W = UZA VT (6.49) 


The feature space defined by y = Wz can be thought of as a linear 
subspace of the measurement space. This subspace is spanned by the D 
row vectors in W. The performance measure associated with this feature 
space is: 


D-1 


JINTER/INTRA (Wz) = > Ji (6.50) 
i=0 


Example 6.3 Feature extraction based on inter/intra distance 
Figure 6.9(a) shows the within-scattering and between-scattering of 
Example 6.1 after simultaneous decorrelation. The within-scattering 
has been whitened. After that, the between-scattering is rotated such 
that its ellipse is aligned with the axes. In this figure, it is easy to see 
which axis is the most important. The eigenvalues of the between- 
scatter matrix are yo = 56.3 and y1 = 2.8, respectively. Hence, omit- 
ting the second feature does not deteriorate the performance much. 

The feature extraction itself can be regarded as an orthogonal 
projection of samples on this subspace. Therefore, decision bound- 
aries defined in the feature space correspond to hyperplanes orthog- 
onal to the linear subspace, i.e. planes satisfying equations of the type 
Wz = constant. 


A characteristic of linear feature extraction based on Jinrerynrra is that 
the dimension of the feature space found will not exceed K — 1, where K 
is the number of classes. This follows from expression (6.7), which 
shows that S, is the sum of K outer products of vectors (of which one 
vector linearly depends on the others). Therefore, the rank of S, cannot 
exceed K — 1. Consequently, the number of nonzero eigenvalues of S, 
cannot exceed K — 1 either. Another way to put this into words is that 
the K conditional means 4, span a (K — 1) dimensional linear subspace 
in RN. Since the basic assumption of the inter/intra distance is that 
within-scattering does not convey any class information, any feature 
extractor based on that distance can only find class information within 
that subspace. 
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Example 6.4 License plate recognition (continued) 
In the license plate application, discussed in Example 6.2, the 
measurement space (consisting of 15 x 11 bitmaps) is too large with 
respect to the size of the training set. Linear feature extraction based 
on maximization of the inter/intra distance reduces this space to at 
most Dmax = K — 1 = 35 features. Figure 6.10(a) shows how the 
inter/intra distance depends on D. It can be seen that at about 
D = 24 the distance has almost reached its maximum. Therefore, a 
reduction to 24 features is possible without losing much information. 
Figure 6.10(b) is a graphical representation of the transformation 
matrix W. The matrix is 24 x 165. Each row of the matrix serves as a 
vector on which the measurement vector is projected. Therefore, each 
row can be depicted as a 15 x 11 image. The figure is obtained by 
means of MATLAB code that is similar to Listing 6.3. 


Listing 6.3 

PRTools code for creating a linear feature extractor based on maximiza- 
tion of the inter/intra distance. The function for calculating the mapping 
is fisherm. The result is an affine mapping, i.e. a mapping of the 
type Wz + b. The additive term b shifts the overall mean of the features 
to the origin. In this example, the measurement vectors come directly 
from bitmaps. Therefore, the mapping can be visualized by images. The 
listing also shows how fisherm can be used to get a cumulative plot of 
JInTER/INTRA> as depicted in Figure 6.10(a). The precise call to fisherm 
is discussed in more detail in Exercise 5. 





et 
s 
“we 


JINTER/INTRA 






YD +10 


[inas Ha 0 
3 


0 10 20 








2 
= Eel 








D 


t 
8 
Wal BPH EL Ca 
me Ee 
ae 
re 


eal SS GA | 


Figure 6.10 Feature extraction in the license plate application. (a) The inter/intra 
distance as a function of D. (b) First 24 eigenvectors in W depicted as 15 x 11 pixel 
images 
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ae 


Load dataset 

Display it 

Calculate criterion values 

and plot them 

Calculate the feature extractor 
Show the mappings as images 


load license_plates.mat 
figure; clf; show(z); 
J=fisherm(z,0); 

figure; clf; plot(J, ‘r.-'’); 
w=fisherm(z,24,0.9); 
figure; clf; show(w); % 


de dP oe 


ae 


The two-class case, Fisher’s linear discrimant 





Sp = yy (N1 (n — A) (fh = A)" + Noli — i) (h A") 
= a(ĝ, — My) (Ay - fl)" 


(6.51) 


where a is a constant that depends on N4 and N3. In the transformed space, 
Sp becomes aA~"/?V" (ft, — fo) (ft, — Bo)’ VA". This matik las only one 
eigenvector with a nonzero eigenvalue yọ = a(@, —f)'S,' (@, — êb). 
The corresponding eigenvector is u = A712 VT (f4 — â). With that, the 


feature extractor evolves into (see (6.49)): 


W = u" AVT 
=(A -ryt (i, — jin) A AVT (6.52) 
= (fy — Êb)" S 


This solution — known as Fisher’s linear discriminant (Fisher, 1936) — 
similar to the Bayes classifier for Gaussian random vectors with class- 
independent covariance matrices; i.e. the Mahalanobis distance clas- 
sifier. The difference with expression (2.41) is that the true covariance 
matrix and the expectation vectors have been replaced with estimates 
from the training set. The multi-class solution (6.49) can be regarded as 
a generalization of Fisher’s linear discriminant. 
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6.5 EXERCISES 


1. Prove equation (6.4). (x) 

Hint: use Zkn — Zin = (Zkyn — Hy) + (My — HB) + (i — i) + (ty — Zim). 

2. Develop an algorithm that creates a tree structure like in Figure 6.4. Can you adapt 
that algorithm such that the tree becomes minimal (thus, without the superfluous 
twigs)? (0) 

3. Under what circumstances would it be advisable to use forward selection, or plus- 
I-takeaway-r selection with l > r? And backward selection, or plus-/-takeaway-r selec- 
tion with | < r? (0) 

4. Prove that W = mTC™ is the feature extractor that maximizes the Bhattacharyyaa 
distance in the two-class Gaussian case with equal covariance matrices. (**) 








5. In Listing 6.3, £isherm is called with 0.9 as its third argument. Why do you think this 
is used? Try the same routine, but leave out the third argument (i.e. use 
w= fisherm(z, 24)). Can you explain what you see now? (x) 

6. Find an alternative method of preventing the singularities you saw in Exercise 6.5. Will 
the results be the same as those found using the original Listing 6.3? (**) 

7. What is the danger of optimizing the parameters of the feature extraction or selection 
stage, such as the number of features to retain, on the training set? How could you 
circumvent this? (0) 
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Unsupervised Learning 


In the previous chapter we discussed methods for reducing the dimen- 
sion of the measurement space in order to decrease the cost of classifica- 
tion and to improve the ability to generalize. In these procedures it was 
assumed that for all training objects, class labels were available. In many 
practical applications, however, the training objects are not labelled, or 
only a small fraction of them are labelled. In these cases it can be worth 
while to let the data speak for itself. The structure in the data will have to 
be discovered without the help of additional labels. 

An example is colour-based pixel classification. In video-based surveil- 
lance and safety applications, for instance, one of the tasks is to track the 
foreground pixels. Foreground pixels are pixels belonging to the objects of 
interest, e.g. cars on a parking place. The RGB representation of a pixel can 
be used to decide whether a pixel belongs to the foreground or not. How- 
ever, the colours of neither the foreground nor the background are known 
in advance. Unsupervised training methods can help to decide which pixels 
of the image belong to the background, and which to the foreground. 

Another example is an insurance company which might want to know 
if typical groups of customers exist, such that it can offer suitable 
insurance packages to each of these groups. The information provided 
by an insurance expert may introduce a significant bias. Unsupervised 
methods can then help to discover additional structures in the data. 

In unsupervised methods, we wish to transform and reduce the data 
such that a specific characteristic in the data is highlighted. In this 
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chapter we will discuss two main characteristics which can be explored: 
the subspace structure of data and its clustering characteristics. The first 
tries to summarize the objects using a smaller number of features than 
the original number of measurements; the second tries to summarize the 
data set using a smaller number of objects than the original number. 
Subspace structure is often interesting for visualization purposes. The 
human visual system is highly capable of finding and interpreting struc- 
ture in 2D and 3D graphical representations of data. When higher 
dimensional data is available, a transformation to 2D or 3D might 
facilitate its interpretation by humans. Clustering serves a similar pur- 
pose, interpretation, but also data reduction. When very large amounts 
of data are available, it is often more efficient to work with cluster 
representatives instead of the whole data set. In Section 7.1 we will treat 
feature reduction, in Section 7.2 we discuss clustering. 


7.1 FEATURE REDUCTION 


The most popular unsupervised feature reduction method is principal 
component analysis (Jolliffe, 1986). This will be discussed in Section 
7.1.1. One of the drawbacks of this method is that it is a linear method, 
so nonlinear structures in the data cannot be modelled. In Section 7.1.2 
multi-dimensional scaling is introduced, which is a nonlinear feature 
reduction method. 


7.1.1 Principal component analysis 


The purpose of principal component analysis (PCA) is to transform a 
high dimensional measurement vector z to a much lower dimensional 
feature vector y by means of an operation: 


y = Woz - 3) (7.1) 


such that z can be reconstructed accurately from y. 

z is the expectation of the random vector z. It is constant for all realiza- 
tions of z. Without loss of generality, we can assume that Z = 0 because we 
can always introduce a new measurement vector, Ž = z — Z, and apply the 
analysis to this vector. Hence, under that assumption, we have: 


y = Wpz (7.2) 
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The D x N matrix Wp transforms the N-dimensional measurement 
space to a D-dimensional feature space. Ideally, the transform is such 
that y is a good representation of z despite of the lower dimension of y. 
This objective is strived for by selecting Wp such that an (unbiased) 
linear MMSE estimate! Żmmsg for z based on y yields a minimum mean 
square error (see Section 3.1.5): 


Wp = arg min{ E | |l2nmuse(¥) — z1? } with y= Wz (7.3) 
W 


It is easy to see that this objective function does not provide a unique 
solution for Wp. If a minimum is reached for some Wp, then any matrix 
AWp is another solution with the same minimum (provided that A is 
invertible) as the transformation A will be inverted by the linear MMSE 
procedure. For uniqueness, we add two requirements. First, we require that 
the information carried in the individual elements of y add up individually. 
With that we mean that if y is the optimal D dimensional representation of 
z, then the optimal D — 1 dimensional representation is obtained from y, 
simply by deleting its least informative element. Usually, the elements of y 
are sorted in decreasing order of importance, so that the least informative 
element is always the last element. With this convention, the matrix Wp_1 
is obtained from Wp simply by deleting the last row of Wp. 

The requirement leads to the conclusion that the elements of y must be 
uncorrelated. If not, then the least informative element would still carry 
predictive information about the other elements of y which conflicts 
with our requirement. Hence, the covariance matrix Cy of y must be a 
diagonal matrix, say A. If C, is the covariance matrix of z, then: 


Cy = WDC, W5 = Ap (7.4) 


For D = N it follows that C,W2, = WIAN because Wy is an invertible 
matrix (in fact, an orthogonal matrix) and WIWyn must be a diagonal 
matrix (see Appendix B.5 and C.3.2). As Ay is a diagonal matrix, the 
columns of WZ must be eigenvectors of C,. The diagonal elements of Ay 
are the corresponding eigenvalues. 

The solution is still not unique because each element of y can be scaled 
individually without changing the minimum. Therefore, the second 
requirement is that each column of WẸ has unit length. Since the 





1 Since z and y are zero mean, the unbiased linear MMSE estimator coincides with the linear 
MMSE estimator. 
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eigenvectors are orthogonal, this requirement is fulfilled by WvWy = I 
with I the N x N unit matrix. With that, Wy establishes a rotation on z. 
The rows of the matrix Ww, i.e. the eigenvectors, must be sorted such 
that the eigenvalues form a non-ascending sequence. For arbitrary D, the 
matrix Wp is constructed from Wy by deleting the last N — D rows. 

The interpretation of this is as follows (see Figure 7.1). The operator 
Wn performs a rotation on z such that its orthonormal basis aligns with 
the principal axes of the ellipsoid associated with the covariance matrix 
of z. The coefficients of this new representation of z are called the 
principal components. The axes of the ellipsoid point in the principal 
directions. The MMSE approximation of z using only D coefficients 
is obtained by nullifying the principal components with least variances. 
Hence, if the principal components are ordered according to their 
variances, the elements of y are formed by the first D principal compon- 
ents. The linear MMSE estimate is: 


ZimmsE(Y) = Why = Wp Woz. 


PCA can be used as a first step to reduce the dimension of the measure- 
ment space. In practice, the covariance matrix is often replaced by the 
sample covariance estimated from a training set. See Section 5.2.3. 
Unfortunately, PCA can be counter-productive for classification and 
estimation problems. The PCA criterion selects a subspace of the feature 
space, such that the variance of z is conserved as much as possible. 
However, this is done regardless of the classes. A subspace with large 
variance is not necessarily one in which classes are well separated. 
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Figure 7.1 Principal component analysis 
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A second drawback of PCA is that the results are not invariant to the 
particular choice of the physical units of the measurements. Each elem- 
ent of z is individually scaled according to its unit in which it is 
expressed. Changing a unit from, for instance, m (meter) to um (micro- 
meter) may result in dramatic changes of the principal directions. 
Usually this phenomenon is circumvented by scaling the vector z such 
that the numerical values of the elements all have unit variance. 


Example 7.1 Image compression 

Since PCA aims at the reduction of a measurement vector in such a way 
that it can be reconstructed accurately, the technique is suitable for 
image compression. Figure 7.2 shows the original image. The image 
plane is partitioned into 32 x 32 regions each having 8 x 8 pixels. 
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Figure 7.2 Application of PCA to image compression 
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The grey levels of the pixels are stacked in 64 dimensional vectors z, 
from which the sample mean and the sample covariance matrix have 
been estimated. The figure also shows the fraction of the cumulative 
eigenvalues, i.e. ae yi / trace(T) where +; is the i-th diagonal element 
of T. The first eight eigenvectors are depicted as 8 x 8 images. The 
reconstruction based on these eight eigenvectors is also shown. The 
compression is about 96%. PRTools code for this PCA compression 
algorithm is given in Listing 7.1. 


Listing 7.1 
PRTools code for finding a set of PCA basis vectors for image compres- 
sion and producing output similar to Figure 7.2. 


im=double(imread(‘car.tif’)); 

figure; clf; imshow(im, [0 255]); 
x=im2col(im, [8 8],’distinct’); 
z=dataset(x’); 

z.featsize=[8 8]; 


Load image 


ae Ae 


Display image 
Extract 8x8 windows 
Create dataset 
Indicate window size 





œ dP oP 


% Plot fraction of cumulative eigenvalues 
v=pca(z,0); figure; clf; plot(v); 


% Find 8D PCA mapping and show basis vectors 
w=pca(z,8); figure; clf; show(w) 

% Reconstruct image and display it 

z hat =z*w*w’'; 

im_hat =col2im (+z_hat’,[8 8], size(im),’distinct’); 
figure; clf; imshow (im_hat, [0 255]); 


The original image blocks are converted into vectors, a data set is created 
and a PCA base is found. This base is then shown and used to recon- 
struct the image. Note that this listing also uses the MATLAB Image 
Processing toolbox, specifically the functions im2col and co12im. 
The function pca is used for the actual analysis. 


7.1.2 Multi-dimensional scaling 


Principal component analysis is limited to finding linear combinations of 
features to map the original data to. This is sufficient for many applica- 
tions, such as discarding low-variance directions in which only noise is 
suspected to be present. However, if the goal of a mapping is to inspect data 
in a two or three dimensional projection, PCA might discard too much 
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information. For example, it can project two distinct groups in the data on 
top of each other; or it might not show whether data is distributed non- 
linearly in the original high dimensional measurement space. 

To retain such information, a nonlinear mapping method is needed. 
As there are many possible criteria to fit a mapping, there are many 
different methods available. Here we will discuss just one: multi- 
dimensional scaling (MDS) (Kruskal and Wish, 1977) or Sammon map- 
ping (Sammon Jr, 1969). The self-organizing map, discussed in Section 
7.2.5, can be considered as another nonlinear projection method. 

MDS is based on the idea that the projection should preserve the 
distances between objects as well as possible. Given a data set containing 
N dimensional measurement vectors (or feature vectors) z; i= 1,...,Ns, 
we try to find new D dimensional vectors y; i = 1,... Ns according to this 
criterion. Usually, D « N; if the goal is to visualize the data, we choose 
D = 2 or D = 3. If 6; denotes the known distance between objects z; and 
Zj, and dj denotes the distance between projected objects y; and y,, then 
distances can be preserved as well as possible by placing the y; such that the 
stress measure 


1 Ns Ns 5 
Es= a n 2 2 (64 di) (7.5) 
D 3 62 i=1 j=i+1 
Hijma 


is minimized. To do so, we take the derivative of (7.5) with respect to 
the objects y. It is not hard to derive that, when Euclidean distances are 
used, then 








OEs 2 Ns S; — di 
dy, Ns Ng 3 7 yi- y;) (7.6) 
Eo 
j=1 k=j+1 


There is no closed-form solution setting this derivative to zero, but a 
gradient descent algorithm can be applied to minimize (7.5). The MDS 
algorithm then becomes: 


Algorithm 7.1: Multi-dimensional scaling 
1. Initialization: Randomly choose an initial configuration of the 


projected objects y. Alternatively, initialize y% by projecting the 
original data set on its first D principal components. Set t = 0. 
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2. Gradient descent 


e For each object y”, calculate the gradient according to (7.6). 
e Update: y“ = yl) adEs/dy", where a is a learning rate. 
e As long as Es significantly decreases, set t = t + 1 and go to step 2. 


Figures 7.3(a) to (c) show examples of two-dimensional MDS mapping. 
The data set, given in Table 7.1, consists of the geodesic distances of 13 
world cities. These distances can only be fully brought in accordance 
with the true three-dimensional geographical positions of the cities if the 
spherical surface of the earth is accounted for. Nevertheless, MDS 
has found two-dimensional mappings that resemble the usual Mercator 
projection of the earth surface on the tangent plane at the North Pole. 
Since distances are invariant to translation, rotation and mirroring, 
MDS can result in arbitrarily shifted, rotated and mirrored mappings. 













































q=-2 q=0 
i Melbourne ` 
8000 Honolulu 8000 Honolulu, t 
Los Angeles 
Los Angeles 
4000 Melbourne Bia 2099 Tokyo 
New York - ms New York Beijing 
: 0 f Santiago 
Santiago London- Bangkok Bangkok 
. Moscow Rio London Moscow 
—4000 Rio -4000 
„` Cairo = Cairo 
—8000 ` Capetown —8000 c apetown 
—8000 —4000 O 4000 8000 -8000 -4000 0O 4000 8000 
q=2 D=3;q=0 
3000 Honolulu... . +. Melbourne 
fe Racel Honolulu 
os Angeles 
4000) Santiago 9 Tok Tokyo 
A Beijing 
Beijing 
Op »/ New York ] 
oe, Bangkok 
Rio Lond Bangkok y - 
—4000 is On Moscow : a Melbourne 
‘+ #Cairo 
—8000 = 
-Capetown 











-8000 -4000 O 4000 8000 


Figure 7.3 MDS applied to a matrix of geodesic distances of world cities 


FEATURE REDUCTION 223 


Table 7.1 Distance matrix of 13 cities in (km) 


0 3290 7280 10100 10600 9540 13300 7350 7060 13900 16100 17700 4640 Bangkok 
0 7460 12900 8150 8090 10100 9190 5790 11000 17300 19000 2130 Beijing 
0 7390 14000 3380 12100 14000 2810 8960 9950 12800 9500 Cairo 
0 18500 9670 16100 10300 10100 12600 6080 7970 14800 Capetown 
0 11600 4120 8880 11200 8260 13300 11000 6220 Honolulu 
0 8760 16900 2450 5550 9290 11700 9560 London 
0 12700 9740 3930 10100 9010 8830 Los Angeles 
0 14400 16600 13200 11200 8200 Melbourne 
0 7510 11500 14100 7470 Moscow 
0 7770 8260 10800 New York 
0 2730 18600 Rio 
0 17200 Santiago 
0 Tokyo 


The mapped objects in Figures 7.3 have been rotated such that New 
York and Beijing lie on a horizontal line with New York on the left. 
Furthermore, the vertical axis is mirrored such that London is situated 
below the line New York-Beijing. 

The reason for the preferred projection on the tangent plane near the 
North Pole is that most cities in the list are situated in the Northern 
hemisphere. The exceptions are Santiago, Rio, Capetown and Melbourne. 
The distances for just these cities are least preserved. For instance, the true 
geodesic distance between Capetown and Melbourne is about 10 000 (km), 
but they are mapped opposite to each other with a distance of about 
18 000 (km). 

For specific applications, it might be fruitful to focus more on local 
structure or on global structure. A way of doing so is by using the more 
general stress measure, where an additional free parameter q is introduced: 


re 5 3 61(5; — di)” (7.7) 
STN; N; ij Ci ij : 


(4+2) j=1 j=i+1 
ap ae) ear 


i=1 j=i+1 


For q = 0, (7.7) is equal to (7.5). However, as q is decreased, the (6; — dj) 
term remains constant but the 67 term will weigh small distances heavier 
than large ones. In other words, local distance preservation is emphasized, 
whereas global distance preservation is less important. This is demon- 
strated in Figure 7.3(a) to (c), where q = —2, q = 0 and q = +2 have been 
used, respectively. Conversely, if q is increased, the resulting stress measure 
will emphasize preserving global distances over local ones. When the stress 
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measure is used with q = 1, the resulting mapping is also called a Sammon 
mapping. 

The effects of using different values of q are demonstrated in Table 7.2. 
Here, for each city in the list, the nearest and the furthest cities are given. 
For each pair, the true geodesic distance from Table 7.1 is shown 
together with the corresponding distances in the MDS mappings. 
Clearly, for short distances, g = —2 is more accurate. For long distances, 
q = +2 is favourable. In Figure 7.3(a) to (c) the nearest cities (according 
to the geodesic distances) are indicated by solid thin lines. In addition, 
the nearest cities according to the MDS maps are indicated by dotted 
thick lines, Clearly, if g = —2, the nearest neighbours are best preserved. 
This is expected because the relations between nearest neighbours are 
best preserved if the local structure is best preserved. 

Figure 7.3 also shows a three-dimensional MDS mapping of the world 
cities. For clarity, a sphere is also included. MDS is able to find a 2D to 
3D mapping such that the resulting map resembles the true geographic 
position on the earth surface. The positions are not exactly on the sur- 
face of a sphere because the input of the mapping is a set of geodesic 
distances, while the output map measures distances according to a three- 
dimensional Euclidean metric. 

The MDS algorithm has the same problems any gradient-based 
method has: it is slow and it cannot be guaranteed to converge to a 
global optimum. Another problem the MDS algorithm in particular 
suffers from is the fact that the number of distances in a data set grows 
quadratically with Ns. If the data sets are very large, then we need to 
select a subset of the points to map in order to make the application 
tractable. Finally, a major problem specific to MDS is that there is no 
clear way of finding the projection y,,,,, of a new, previously unseen 
point Zye~. The only way of making sure that the new configuration 
{y i= 1,... Ns; Ypew} minimizes (7.5) is to recalculate the entire MDS 
mapping. In practice, this is infeasible. A crude approximation is to use 
triangulation: to map down to D dimensions, search for the D+ 1 
nearest neighbours of Znew among the training points z;. A mapping 
Ynew fOr Znew can then be found by preserving the distances to the 
projections of these neighbours exactly. This method is fast, but inaccur- 
ate: because it uses nearest neighbours, it will not give a smooth inter- 
polation of originally mapped points. A different, but more complicated 
solution is to use an MDS mapping to train a neural network to predict 
the y; for given z;. This will give smoother mappings, but brings with it 
all the problems inherent in neural network training. An example of how 
to use MDS in PRTools is given in Listing 7.2. 
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Listing 7.2 
PRTools code for performing an MDS mapping. 


load worldcities; % Load dataset D 
options.q=2; 


ae 


w=mds(D,2,options); Map to 2Dwithq=2 
figure; clf; scatterd(D*w, ‘both’); % Plot projections 


7.2 CLUSTERING 


Instead of reducing the number of features, we now focus on reducing 
the number of objects in the data set. The aim is to detect ‘natural’ 
clusters in the data, i.e. clusters which agree with our human interpret- 
ation of the data. Unfortunately, it is very hard to define what a natural 
cluster is. In most cases, a cluster is defined as a subset of objects for 
which the resemblance between the objects within the subset is larger 
than the resemblance with other objects in other subsets (clusters). 

This immediately introduces the next problem: how is the resemblance 
between objects defined? The most important cue for the resemblance of 
two objects is the distance between the objects, i.e. their dissimilarity. In 
most cases the Euclidean distance between objects is used as a dissimi- 
larity measure, but there are many other possibilities. The Lp norm is 
well known (see Appendix A.1.1 and A.2): 


N p 


dy (Zi, Zj) = ay r= Zine (7.8) 


n=1 


The cosine distance uses the angle between two vectors as a dissimilarity 
measure. It is often used in the automatic clustering of text documents: 


Tos 
Zi Zj 


Ilzillal|zillo 


d(zj, zj) =1- (7.9) 


In PRTools the basic distance computation is implemented in the 
function proxm. Several methods for computing distances (and similar- 
ities) are defined. Next to the two basic distances mentioned above, also 
some similarity measures are defined, like the inner product between 
vectors, and the Gaussian kernel. The function is implemented as a 
mapping. When a mapping w is trained on some data z, the application 
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yxw then computes the distance between any vector in z and any vector 
in y. For the squared Euclidean distances a short-cut function distm is 


defined. 


Listing 7.3 
PRTools code for defining and applying a proximity mapping. 


Create some train data 

and some test data 

Squared Euclidean distance toz 
5 x 3 distance matrix 

The same 5 x 3 distance matrix 
w=proxm(z,‘o'’); Cosine distance toz 

D=y*w; % New 5 x 3 distance matrix 


z=gendatb(3); 
y=gendats(5); 
w=proxm(z,‘d’,2); 


oP AP 


oe 





oe 


D=y*w; 
D=distm(y,z); 


oe 


oe 


The distance between objects should reflect the important structures in 
the data set. It is assumed in all clustering algorithms that distances 
between objects are informative. This means that when objects are close 
in the feature space, they should also resemble each other in the real 
world. When the distances are not defined sensibly, and remote objects 
in the feature space correspond to similar real-world objects, no clus- 
tering algorithm will be able to give acceptable results without extra 
information from the user. In these sections it will therefore be assumed 
that the features are scaled such that the distances between objects are 
informative. Note that a cluster does not necessarily correspond directly 
to a class. A class can consist of multiple clusters, or multiple classes may 
form a single cluster (and will therefore probably be hard to discriminate 
between). 

By the fact that clustering is unsupervised, it is very hard to evaluate a 
clustering result. Different clustering methods will yield a different set of 
clusters, and the user has to decide which clustering is to be preferred. 
A quantitative measure of the quality of the clustering is the average 
distance of the objects to their respective cluster centre. Assume that the 
objects z; (i= 1,..., Ns) are clustered in K clusters, C, (k = 1,..., K) 
with cluster centre 4, and to each of the clusters N; objects are assigned: 





(Peay are: (7.10) 
Nt Ne ; l 


ZE Cy 


Other criteria, such as the ones defined in Chapter 5 for supervised 
learning, can also be applied. 
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With these error criteria, different clustering results can be compared 
provided that K is kept constant. Clustering results found for varying K 
cannot be compared. The pitfall of using criteria like (7.10) is that the 
optimal number of clusters is K = Ns, i.e. the case in which each object 
belongs to its own cluster. In the average-distance criterion it will result 
in the trivial solution: J = 0. 

The choice of the number of clusters K is a very fundamental problem. 
In some applications, an expected number of clusters is known beforehand. 
Using this number for the clustering does not always yield optimal results 
for all types of clustering methods: it might happen that, due to noise, a 
local minimum is reached. Using a slightly larger number of clusters is 
therefore sometimes preferred. In most cases the number of clusters to look 
for is one of the research questions in the investigation of a data set. Often, 
the data is repeatedly clustered using a range of values of K, and the 
clustering criterion values are compared. When a significant decrease in a 
criterion value appears, a ‘natural’ clustering is probably found. Unfortu- 
nately, in practice, it is very hard to objectively say when a significant drop 
occurs. True automatic optimization of the number of clusters is possible 
in just a very few situations, when the cluster shapes are given. 

In the coming sections we will discuss four methods for performing 
a clustering: hierarchical clustering, K-means clustering, mixtures of 
Gaussians and finally self-organizing maps. 


7.2.1 Hierarchical clustering 


The basic idea of hierarchical clustering (Johnson, 1967) is to collect 
objects into clusters by combining the closest objects and clusters to 
larger clusters until all objects are in one cluster. An important advan- 
tage is that the objects are not just placed in K distinct groups, but are 
placed in a hierarchy of clusters. This gives more information about the 
structure in the data set, and shows which clusters are similar or dissimi- 
lar. This makes it possible to detect subclusters in large clusters, or to 
detect outlier clusters. 

Figure 7.4 provides an example. The distance matrix of the 13 world 
cities given in Table 7.1 are used to find a hierarchical cluster structure. 
The figure shows that at the highest level the cities in the southern 
hemisphere are separated from the ones in the northern part. At a 
distance level of about 5000km we see the following clusters: ‘East 
Asiatic cities’, ‘European/Arabic cities’, ‘North American cities’, ‘African 
city’, ‘South American cities’, ‘Australian city’. 


CLUSTERING 229 




















































































































14000 
12000 
10000 
8000 
6000 
4000 
2000 
0 

SPSestseeee Se 

oaesccesxSEa>~o BB 

gar~ S85528 § 8 

= <í D D 

ra re 2s no D 

8 = 


Figure 7.4 Hierarchical clustering of the data in Table 7.1 


Given a set of Ns objects z; to be clustered, and an Ns x Ns distance 
matrix between these objects, the hierarchical clustering involves the 
following steps: 


Algorithm 7.2: Hierarchical clustering 


1. Assign each object to its own cluster, resulting in Ns clusters, each 
containing just one object. The initial distances between all clusters 
are therefore just the distances between all objects. 

2. Find the closest pair of clusters and merge them into a single cluster, 
so that the number of clusters reduces by one. 

3. Compute the distance between the new cluster and each of the old 
clusters, where the distance between two clusters can be defined in a 
number of ways (see below). 

4. Repeat steps 2 and 3 until all items are clustered into a single cluster 
of size Ns, or until a predefined number of clusters K is achieved. 


In step 3 it is possible to compute the distances between clusters in 
several different ways. We can distinguish single-link clustering, aver- 
age-link clustering and complete-link clustering. In single-link clus- 
tering, the distance between two clusters C; and C; is defined as the 
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shortest distance from any object in one cluster to any object in the 
other cluster: 


d,(C;,Cj) = min |x- yl? (7.11) 


xe Cj, yeC; 


For average-link clustering, the minimum operator is replaced by the 
average distance, and for the complete-link clustering it is replaced by 
the maximum operator. 

In Figure 7.5 the difference between single link and complete link is 
shown for a very small toy data set (Ns = 6). At the start of the clus- 
tering, both single-link (left) and complete-link clustering (right) combine 
the same objects to clusters. When larger clusters appear, in the lower 
row, different objects are combined. The different definitions for the 
inter-cluster distances result in different characteristic cluster shapes. For 
single-link clustering, the clusters tend to become long and spidery, while 
for complete-link clustering the clusters become very compact. 

The user now has to decide on what the most suitable number of 
clusters is. This can be based on a dendrogram. The dendrogram shows 
at which distances the objects or clusters are grouped together. Examples 
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Figure 7.5 The development from K = Ns clusters to K = 1 cluster. (a) Single-link 
clustering. (b) Complete-link clustering 
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Figure 7.6 Hierarchical clustering with two different clustering types. (a) Single- 
link clustering. (b) Complete-link clustering 


for single- and complete-link clustering are shown in Figure 7.6. At 
smaller distances, pairs of single objects are combined, at higher dis- 
tances complete clusters. When there is a large gap in the distances, as 
can be seen in the single-link dendrogram, it is an indication that the two 
clusters are far apart. Cutting the dendrogram at height 1.0 will then 
result in a ‘natural’ clustering, consisting of two clusters. In many 
practical cases, the cut is not obvious to define, and the user has to guess 
an appropriate number of clusters. 

Note that this clustering is obtained using a fixed data set. When new 
objects become available, there is no straightforward way to include it in 
an existing clustering. In these cases, the clustering will have to be 
constructed from the beginning using the complete data set. 

In PRTools, it is simple to construct a hierarchical clustering; Listing 
7.4 shows an example. Note that the clustering operates on a distance 
matrix rather than the data set. A distance matrix can be obtained with 
the function distm. 


Listing 7.4 
PRTools code for obtaining a hierarchical clustering. 


z=gendats(5); % Generate some data 
figure; clf; scatterd(z); % and plot it 
dendr=hclust (distm(z),‘s’); % Single link clustering 
figure; clf; plotdg (dendr); % Plot the dendrogram 
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7.2.2 K-means clustering 


K-means clustering (Bishop, 1995) differs in two important aspects from 
hierarchical clustering. First, K-means clustering requires the number of 
clusters K beforehand. Second, it is not hierarchical, instead it partitions 
the data set into K disjoint subsets. Again the clustering is basically 
determined by the distances between objects. The K-means algorithm 
has the following structure: 


Algorithm 7.3: K-means clustering 


1. Assign each object randomly to one of the clusters k = 1,...K. 
2. Compute the means of each of the clusters: 


1 


ZjECp 


Ow 


Reassign each object z; to the cluster with the closest mean u4. 
4. Return to step 2 until the means of the clusters do not change any- 
more. 


The initialization step can be adapted to speed up the convergence. 
Instead of randomly labelling the data, K randomly chosen objects are 
taken as cluster means. Then the procedure enters the loop in step 3. 
Note again that the procedure depends on distances, in this case between 
the objects z; and the means w,. Scaling the feature space will here also 
change the final clustering result. An advantage of K-means clustering is 
that it is very easy to implement. On the other hand, it is unstable: 
running the procedure several times will give several different results. 
Depending on the random initialization, the algorithm will converge to 
different (local) minima. In particular when a high number of clusters is 
requested, it often happens that some clusters do not gain sufficient 
support and are ignored. The effective number of clusters then becomes 
much less than K. 

In Figure 7.7 the result of a K-means clustering is shown for a simple 
2D data set. The means are indicated by the circles. At the start of the 
optimization, i.e. at the start of the trajectory, each mean coincides with 
a data object. After 10 iteration steps, the solution converged. The result 
of the last iteration is indicated by ‘x’. In this case, the number of 
clusters in the data and the predefined K = 3 match. A fairly stable 
solution is found. 
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Figure 7.7 The development of the cluster means during 10 update steps of the 


K-means algorithm 


Example 7.2 Classification of mechanical parts, K-means clustering 
Two results of the K-means algorithm applied to the unlabelled data 
set of Figure 5.1(b) are shown in Figure 7.8. The algorithm is called 
with K = 4. The differences between the two results are solely caused 
by the different realizations of the random initialization of the algo- 
rithm. The first result, Figure 7.8(a), is more or less correct (compare 
with the correct labelling as given in Figure 5.1(a). Unfortunately, the 
result in Figure 7.8(b) indicates that this success is not reproducible. 
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Figure 7.8 Two results of K-means clustering applied to the ‘mechanical parts’ data set 
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The algorithm is implemented in PRTools with the function kmeans. 
See Listing 7.5. 


Listing 7.5 
PRTools code for fitting and plotting a K-means clustering, with K = 4. 


oe 


Load the data set z 

Perform k-means clustering 
Label by cluster assignment 
and plot it 


load nutsbolts_unlabeled; 
lab=kmeans(z,4); 
y=dataset (z,lab); 
figure; clf; scatterd(y); 


oe 





oe 


oe 


7.2.3 Mixture of Gaussians 


In the K-means clustering algorithm, spherical clusters were assumed by 
the fact that the Euclidean distance to the cluster centres 4, is computed. 
All objects on a circle or hypersphere around the cluster centre will have 
the same resemblance to that cluster. However, in many cases clusters 
have more structure than that. In the mixture of Gaussians model 
(Dempster etal., 1977; Bishop, 1995), it is assumed that the objects in 
each of the K clusters are distributed according to a Gaussian distribu- 
tion. That means that each cluster is not only characterized by a mean 4, 
but also by a covariance matrix C}. In effect, a complete density estimate 
of the data is performed, where the density is modelled by: 


K 
p(z) = X` weN(z\Mps Cr) (7.13) 
k=1 


N(z|#,,C,) stands for the multivariate Gaussian distribution. mg are the 
mixing parameters (for which a T, = 1 and m, > 0). The mixing 
parameter m, can be regarded as the probability that z is produced by 
a random number generator with probability density N(z|,, C,). 

The parameters to be estimated are: the number K of mixing compon- 
ents, the mixing parameters 7,...,7x, the mean vectors u4, and the 
covariance matrices C}. We will denote this set of free parameters by 
YP = {tp Up, Clk = 1,...,K}. This increase in the number of free 
parameters, compared to the K-means algorithm, leads to the need for 
a higher number of training objects in order to reliably estimate these 
cluster parameters. On the other hand, because more flexible cluster 
shapes are applied, fewer clusters might have to be used to approximate 
the structure in the data. 
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One method for fitting the parameters of (7.13) to the data is to use 
maximum likelihood estimation (see Section 3.1.4), i.e. to optimize the 
log-likelihood: 


Ns 
L(Z)P)= In p(zi,...,2ng|¥) = In [ [ pY) (7.14) 
n=1 


with Z = {z1,...,ZN,}. The optimization of L(Z|¥) w.r.t. Y is compli- 
cated and cannot be done directly with this equation. Fortunately, there 
exists a standard ‘hill-climbing’ algorithm to optimize the parameters in 
this equation. The expectation-maximization (EM) algorithm is a gen- 
eral procedure which assumes the following problem definition. We 
have two data spaces: the observed data Z and the so-called missing 
data (hidden data) X = {x1,...,xn,}. Let Y be the complete data in 
which each vector y,, consists of a known part z, and a missing part Xp. 
Thus, the vectors y„ are defined by yf =[z! xT]. Only z, is available 
in the training set, x, is missing, and as such unknown. The EM algo- 
rithm uses this model of ‘incomplete data’ to iteratively estimate the 
parameters of the distribution of z,. It tries to maximize the log- 
likelihood L(Z|). In fact, if the result of the i-th iteration is denoted 
by YO, then the EM algorithm assures that L(Z|¥"*") > L(Z|®"). 

The mathematics behind the EM algorithm is rather technical and will 
be omitted. However, the intuitive idea behind the EM algorithm is as 
follows. The basic assumption is that with all data available, that is if we 
had observed the complete data Y, the maximization of L(Y|‘¥) would 
be simple, and the problem could be solved easily. Unfortunately, only Z 
is available and X is missing. Therefore, we maximize the expectation 
E[L(Y|)|Z] instead of L(Y|¥). The expectation is taken over the com- 
plete data Y, but under the condition of the observed data Z. The 
estimate from the i-th iteration follows from: 


E[L(Y|¥)|Z] = 1 (In p(Y|W))p(¥|Z, ¥®)dY 
(7.15) 
we) — arg max{E[L(Y|P)|Z]} 
y 


The integral extends over the entire Y space. The first step is the E step; 
the last one is the M step. 

In the application of EM optimization to a mixture of Gaussians (with 
predefined K; see also the discussion on page 228) the missing part xņ, 
associated with z,, is a K dimensional vector. x, indicates which one of 
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the K Gaussians generated the corresponding object z„. The x, are called 
indicator variables. They use position coding to indicate the Gaussian 
associated with z,. In other words, if z, is generated by the k-th Gaus- 
sian, then x„ = 1 and all other elements in x, are zero. With that, the 
prior probability that x,,, = 1 is mg. 

In this case, the complete log-likelihood of ¥ can be written as: 


Ns 
L(Y|¥) = [pent In] [Case p(x,|P) 


= in [| J [Nenin Ce) mel (7.16) 


Under the condition of a given Z, the probability density p(Y|Z, ¥) = 
p(X, Z|Z, ¥) can be replaced with the marginal probability P(X|Z, ¥). 
Therefore in (7.15), we have: 


E[L(Y)¥)|Z] = J (in p(YI#))p(Y|Z, Pav 


(7.17) 
= SU L(Y[P)P(X|Z, ¥) 
X 
But since in (7.16) L(Y[Ħ) is linear in X, we conclude that 
E[L(Y|W)|Z] = L(X, ZIY) (7.18) 


where X is the expectation of the missing data under the condition of Z 
and Y®:; 


Eng = Eleng Zn YO] = P(%nl2ns PO) 


; (i) 
PEnlXn,bs PO )P(% nal YO) N (zw C o)ra (7.19) 
Zn wp) =K i i i 
p(z P") EN ole! Ko a 
j=l 





The variable x,,, is called the ownership because it indicates to what 
degree sample z, is attributed to the k-th component. 
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For the M step, we have to optimize the expectation of the log- 
likelihood, that is E[L(Y|)|Z] = L(X, Z|). We do this by substitut- 
ing ¥„ k for x, into equation (7.16), taking the derivative with respect 
to and setting the result to zero. Solving the equations will yield 
expressions for the parameters Y = {7}, Ug, Ck} in terms of the data z, 
and X,,4. 

Taking the derivative of L(X, Z|) with respect to 4, gives: 


X Engla — My) Cy? = 0 (7.20) 
Rewriting this, gives the update rule for u,: 


Ns 
> Xn kZn 
n=1 


5 
D Xnk 
n=1 


The estimation of C, is somewhat more complicated. With the help of 
(b.39), we can derive: 


0 7 1 1 N ~T 
aC gt N Olke C,) = z Ck = 3 (en — Hy) (Zn — Hy) (7.22) 


This results in the following update rule for C,: 


Ns 
o E a a — Ay) En — Hy)” 
C, =! = (7.23) 
2 Xnk 





Finally, the parameters m, cannot be optimized directly because of the 
extra constraint, namely that $`% ; mę = 1. This constraint can be 
enforced by introducing a Lagrange multiplier A and extending the log- 
likelihood (7.16) by: 


Ns K K 
L(Y[P) =X X lene nN Zn|Mp, Ck) + Xn In Tk — A (dom -1] 
k=1 


n=1 k=1 
(7.24) 
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Substituting x,,, for x, and setting the derivative with respect to mg to 
zero, yields: 


Ns 
us = ATk (7.25) 
n=1 


Summing equation (7.25) over all clusters, we get: 
K 
tA SA (7.26) 


Further note, that summing 5w peer, Xn,z over all clusters gives the 
total number of objects Ns, thus it follows that A = Ns. By substituting 
this result back into (7.25), the update rule for m becomes: 


Th = N; 2 Žr (7.27) 


Note that with (7.27), the determination of u, and C, in (7.21) and 
(7.23) can be simplified to: 


3 i = 
= 7x Xn Zn 
(7.28) 
A i z 
Ck = =~ Xn k(Zn — g) (Zn — À 
k Nea 2* (Zn — Mp) (Zn — He) 


The complete EM algorithm for fitting a mixture of Gaussian model to a 
data set is as follows. 


Algorithm 7.4: EM algorithm for estimating a mixture of Gaussians 
Input: The number K of mixing components and the data z,. 


1. Initialization: Select randomly (or heuristically) ¥°). Set i = 0. 

2. Expectation step (E step): Using the observed data set z, and the 
estimated parameters Y, calculate the expectations x, of the miss- 
ing values x, using (7.19). 
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3. 


Maximization step (M step): Optimize the model parameters 
PD — (ap, Bus Ce} by maximum likelihood estimation using the 
expectation x, calculated in the previous step. See (7.27) and (7.28). 
Stop if the likelihood did not change significantly in step 3. Else, 
increment i and go to step 2. 


Clustering by EM results in more flexible cluster shapes than K-means 
clustering. It is also guaranteed to converge to (at least) a local optimum. 
However, it still has some of the same problems of K-means: the choice 


of 


the appropriate number of clusters, the dependence on initial condi- 


tions and the danger of convergence to local optima. 


Example 7.3 Classification of mechanical parts, EM algorithm for 
mixture of Gaussians 
Two results of the EM algorithm applied to the unlabelled data set of 
Figure 5.1(b) are shown in Figure 7.9. The algorithm is called with 
K = 4, which is the correct number of classes. Figure 7.9(a) is a 
correct result. The position and size of each component is appropri- 
ate. With random initializations of the algorithm, this result is repro- 
duced with a probability of about 30%. Unfortunately, if the 
initialization is unlucky, the algorithm can also get stuck in a local 
minimum, as shown in Figure 7.9(b). Here, the components are at the 
wrong position. 

The EM algorithm is implemented in PRTools by the function 
emclust. Listing 7.6 illustrates its use. 
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Figure 7.9 Two results of the EM algorithm for mixture of Gaussians estimation 
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Listing 7.6 
MATLAB code for calling the EM algorithm for mixture of Gaussians 
estimation. 





load nutsbolts_unlabeled; % Load the data set z 
z=setlabtype(z,‘soft’); % Set probabilistic labels 
[lab,w] =emclust(z,qdc, 4); % Cluster using EM 

figure; clf; scatterd(z); 

plotm(w,[],0.2:0.2:1); % Plot results 


7.2.4 Mixture of probabilistic PCA 


An interesting variant of the mixture of Gaussians is the mixture of 
probabilistic principal component analyzers (Tipping and Bishop, 
1999). Each single model is still a Gaussian like in (7.13), but its 
covariance matrix is constrained: 


C, = W,W, + ofl (7.29) 


where the D x N matrix W, has the D eigenvectors corresponding to the 
largest eigenvalues of C, as its columns, and the noise level outside the 
subspace spanned by W; is estimated using the remaining eigenvalues: 


N 
m=D+1 


The EM algorithm to fit a mixture of probabilistic principal component 
analyzers proceeds just as for a mixture of Gaussians, using C, instead of 
C}. At the end of the M step, the parameters W, and 07 are re-estimated 
for each cluster k by applying normal PCA to C, and (7.30), respectively. 

The mixture of probabilistic principal component analyzers intro- 
duces a new parameter, the subspace dimension D. Nevertheless, it uses 
far fewer parameters (when D « N) than the standard mixture of 
Gaussians. Still it is possible to model nonlinear data, which cannot be 
done using normal PCA. Finally, an advantage over normal PCA is that 
it is a full probabilistic model, i.e. it can be used directly as a density 
estimate. 

In PRTools, probabilistic PCA is implemented in qdc: an additional 
parameter specifies the number of subspace dimensions to use. To train a 
mixture, one can use emclust, as is illustrated in Listing 7.7. 
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Listing 7.7 
MATLAB code for calling the EM algorithm for mixture of probabilistic 
principal component analyzers estimation. 





load nutsbolts_unlabeled; % Load the data set z 
z=setlabtype(z, ‘soft’); % Set probabilistic labels 
[lab,w] =emclust(z,qdc([],[],[],1),4); %Cluster 1D PCAs using EM 
figure; clf; scatterd(z); 

plotmiw, Pl 0.240.241) 9 % Plot results 


7.2.5  Self-organizing maps 


The self-organizing map (SOM; also known as self-organizing feature 
map, Kohonen map) is an unsupervised clustering and feature extraction 
method in which the cluster centres are constrained in their placing 
(Kohonen, 1995). The construction of the SOM is such that all objects 
in the input space retain as much as possible their distance and neigh- 
bourhood relations in the mapped space. In other words, the topology is 
preserved in the mapped space. The method is therefore strongly related 
to multi-dimensional scaling. 

The mapping is performed by a specific type of neural network, equipped 
with a special learning rule. Assume that we want to map an 
N-dimensional measurement space to a D-dimensional feature space, 
where D < N. In fact, often D = 1 or D = 2. In the feature space, we 
define a finite orthogonal grid with Mı x M2 x --- x Mp grid points. At 
each grid point we place a neuron. Each neuron stores an N-dimensional 
vector that serves as a cluster centre. By defining a grid for the neurons, each 
neuron does not only have a neighbouring neuron in the measurement 
space, it also has a neighbouring neuron in the grid. During the learning 
phase, neighbouring neurons in the grid are enforced to also be neighbours 
in the measurement space. By doing so, the local topology will be preserved. 

The SOM is updated using an iterative update rule, which honours the 
topological constraints on the neurons. The iteration runs over all training 
objects Z,, n = 1,..., Ns and at each iteration, the following steps are taken: 


Algorithm 7.5: Training a self-organizing map: 


1. Initialization: Choose the grid size, M,,..., Mp, and initialize the 
weight vectors w (where j = 1,...,M, x --- Mp) of each neuron, 
for instance, by assignment of the values of M; x --- Mp different 


samples from the data set and that are randomly selected. Set i = 0. 
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2. Iterate: 


2.1 Find, for each object z, in the training set, the most similar 
neuron wi. 


k(z,) = arg min ||z, — wi | (7.31) 
j 


This is called the best-matching or winning neuron for this input 
vector. 

2.2 Update the winning neuron and its neighbours using the update 
rule: 


wht) = w + Oh (k(n) — jl) @n — wP) (7.32) 


2.3 Repeat 2.1 and 2.2 for all samples z, in the data set. 
2.4 If the weights in the previous steps did not change significantly, 
then stop. Else, increment i and go to step 2.1. 


Here n” is the learning rate and h"(|k(z,) — j|) is a weighting function. 
Both can depend on the iteration number i. This weighting function 
weighs how much a particular neuron in the grid is updated. The term 
|k(Zn) — j| indicates the distance between the winning neuron k(z„) and 
neuron j, measured over the grid. The winning neuron (for which 
j = k(zn)) will get the maximal weight, because h®() is chosen such that: 


bO() <1 and bY (0) =1 (7.33) 


Thus, the winning neuron will get the largest update. This update moves 
the neuron in the direction of z, by the term (z, — wj). 

The other neurons in the grid will receive smaller updates. Since we 
want to preserve the neighbourhood relations only locally, the further 
the neuron is from the winning neuron, the smaller the update. 
A commonly used weighting function which satisfies these requirements 
is the Gaussian function: 


h(x) = exp (- z) (7.34) 


(i) 


For this function a suitable scale c; over the map should be defined. 
This weighting function can be used for a grid of any dimension (not just 
one-dimensional), when we realize that |k(z,,) — j| means in general the 
distance between the winning neuron k(z,,) and neuron j over the grid. 
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Let us clarify this with an example. Assume we start with data z, 
uniformly distributed in a square in a two-dimensional measurement 
space, and we want to map this data into a one-dimensional space. 
Therefore, K = 15 neurons are defined. These neurons are ordered, such 
that neuron j — 1 is the left neighbour of neuron j and neuron j + 1 is the 
right neighbour. In the weighting function ø = 1 is used, in the update 
rule 7 = 0.01; these are not changed during the iterations. The neurons 
have to be placed as objects in the feature space such that they represent 
the data as best as possible. Listing 7.8 shows an implementation for 
training a one-dimensional map in PRTools. 


Listing 7.8 
PRTools code for training and plotting a self-organizing map. 


oe 


z=rand(100,2); Generate the data set z 
w=som(z,15); Train a 1D SOM and show it 
figure; clf; scatterd(z); plotsom(w) ; 


oe 


In Figure 7.10 four scatter plots of this data set with the SOM (K = 15) 
are shown. In the left subplot, the SOM is randomly initialized by 
picking K objects from the data set. The lines between the neurons 
indicate the neighbouring relationships between the neurons. Clearly, 
neighbouring neurons in feature space are not neighbouring in the grid. 
In the fourth subplot it is visible that after 100 iterations over the data 
set, the one-dimensional grid has organized itself over the square. This 
solution does not change in the next 500 iterations. 

With one exception, the neighbouring neurons in the measurement 
space are also neighbouring neurons in the grid. Only where the one- 
dimensional string crosses, neurons far apart in the grid become close 
neighbours in feature space. This local optimum, where the map did 
not unfold completely in the measurement space, is often encountered in 
SOMs. It is very hard to get out of this local optimum. The solution 
would be to restart the training with another random initialization. 

Many of the unfolding problems, and the speed of convergence, can be 
solved by adjusting the learning parameter 7) and the characteristic 
width in the weighting function h'(|k(z,) — j|) during the iterations. 
Often, the following functional forms are used: 


(i) — p0) = 
1) n exp( i/T1) (7.35) 


Oi) = O(0) exp(—i/72) 
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Figure 7.10 The development of a one-dimensional self-organizing map, trained on 
a two-dimensional uniform distribution: (a) initialization; (b)-(d) after 10, 25 and 
100 iterations, respectively 


This introduces two additional scale parameters which have to be set by 
the user. 

Although the SOM offers a very flexible and powerful tool for map- 
ping a data set to one or two dimensions, the user is required to make 
many important parameter choices: the dimension of the grid, the num- 
ber of neurons in the grid, the shape and initial width of the neighbour- 
hood function, the initial learning rate and the iteration dependencies of 
the neighbourhood function and learning rate. In many cases a two- 
dimensional grid is chosen for visualization purposes, but it might not fit 
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the data well. The retrieved data reduction will therefore not reflect the 
true structure in the data and visual inspection only reveals an artificially 
induced structure. Training several SOMs with different settings might 
provide some stable solution. 


Example 7.4 SOM of the RGB samples of an illuminated surface 

Figure 7.11(a) shows the RGB components of a colour image. The 
imaged object has a spherical surface that is illuminated by a spot 
illuminator. The light—-material interaction involves two distinct 
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Figure 7.11 A SOM that visualizes the effects of a highlight. (a) RGB image of an 
illuminated surface with a highlight (=glossy spot). (b) Scatter diagram of RGB 
samples together with a one-dimensional SOM 
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physical mechanisms: diffuse reflection and specular (mirror-like) 
reflection. The RGB values that result from diffuse reflection are 
invariant to the geometry. Specular reflection only occurs at specific 
surface orientations determined by the position of the illuminator and 
the camera. Therefore, specular reflection is seen in the image as a 
glossy spot, a so-called highlight. The colour of the surface is deter- 
mined by its spectral properties of the diffuse reflection component. 
Usually, specular reflection does not depend on the wavelength of the 
light so that the colour of the highlight is solely determined by the 
illuminator (usually white light). 

Since light is additive, a RGB value z is observed as a linear 
combination of the diffuse component and the off-specular compon- 
ent: Z = AZgif¢ + BZspec. The variables a and 3 depend on the geom- 
etry. The estimation of zi and Zspec from a set of samples of z is an 
interesting problem. Knowledge of zg; and Zspec would, for instance, 
open the door to ‘highlight removal’. 

Here, a one-dimensional SOM of the data is helpful to visualize the 
data. Figure 7.11(b) shows such a map. The figure suggests that the 
data forms a one-dimensional manifold, i.e. a curve in the three- 
dimensional RGB space. The manifold has the shape of an elbow. In 
fact, the orientation of the lower part of the elbow corresponds to 
data from an area with only diffuse reflection, i.e. to azyi. The upper 
part corresponds to data from an area with both diffuse and specular 
reflection, i.e. to azgig¢ + BZspec- 


7.2.6 Generative topographic mapping 


We conclude this chapter with a probabilistic version of the self- 
organizing map, the generative topographic mapping (or GTM) (Bishop 
etal., 1998). The goal of the GTM is the same as that of the SOM: to 
model the data by clusters with the constraint that neighbouring clusters 
in the original space are also neighbouring clusters in the mapped space. 
Contrary to the SOM, a GTM is fully probabilistic and it can be trained 
using the EM algorithm. 

The GTM starts from the idea that the density p(z) can be represented 
in terms of a D-dimensional latent variable q, where in general D < N. 
For this, a function z = ¢(q; W) with weights W has to be found. The 
functions maps a point q into a corresponding object z = @(q; W). 
Because in reality the data will not be mapped perfectly, we assume 
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some Gaussian noise with variance øo? in all directions. The full model 


for the probability of observing a vector z is: 


W) — zl]? 
50 law) ‘) (7.36) 


In general, the distribution of z in the high dimensional space can be 
found by integration over the latent variable q: 








p(2|W, 02) = f p(zlq, W,02)p(a)dq (7.37) 


In order to allow an analytical solution of this integral, a simple grid-like 
probability model is chosen for p(q), just like in the SOM: 


K 
pla) ==> sa-a) (7.38) 
k=1 


i.e. a set of Dirac functions centred on grid nodes q,. The log-likelihood 
of the complete model can then be written as: 


K 
In L(W, 07) = =Sotn( X p(znla, W, z) (7.39) 
k=1 


Still, the functional form for the mapping function ¢(q; W) has to be 
defined. This function maps the low dimensional grid to a manifold in 
the high dimensional space. Therefore, its form controls how nonlinear 
the manifold can become. In the GTM, a regression on a set of fixed 
basis functions is used: 


¢(q; W) = Wy(q) (7.40) 


y(q) is a vector containing the output of M basis functions, which are 
usually chosen to be Gaussian with means on the grid points and a fixed 
width o,. W is a N x M weight matrix. 

Given settings for K, M and o,, the EM algorithm can be used to 
estimate W and o°. Let the complete data be y? = [zx], with x, the 
hidden variables. x, is a K-dimensional vector. The element x, codes 
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the membership of z, with respect to the k-th cluster. After proper 
initialization, the EM algorithm proceeds as follows: 


Algorithm 7.6: EM algorithm for training a GTM: 


1. E step: estimate the missing data: 





7 im Zalqa, W, ô? 
Enk = P(dltns W, 62) = 2E) (741) 
D p(Zaldk W, ô?) 
k=1 
using Bayes’ theorem. 
2. M step: re-estimate the parameters: 
W= GTA X Z (7.42) 
2 1 SÉ $ 2 
ô = NND X Xn ell Wy (ay) — zall (7.43) 


n=1 k=1 


3. Repeat 1 and 2 until no significant changes occur. 


G is a K x K diagonal matrix with Gg = ea Zn, k as elements. T is a 
K x M matrix containing the basis functions: Tgm =m(qg). X is a 


Ns x K matrix with the x, as elements. Z is a Ns x N matrix. It 
T 


, as its rows. Finally, \ is a regularization 


contains the data vectors z 
parameter, which is needed in cases where TTGT becomes singular and 
the inverse cannot be computed. 

The GTM can be initialized by performing PCA on the data set, 
projecting the data to D dimensions and finding W such that the result- 
ing latent variables q,, approximate as well as possible the projected 
data. The noise variance o? can be found as the average of the N — D 
smallest eigenvalues, as in probabilistic PCA. 

Training the GIM is usually quick and converges well. However, it 
suffers from the standard problem of EM algorithms in that it may 
converge to a local optimum. Furthermore, its success depends highly 
on the choices for K (the number of grid points), M (the number of basis 
functions for the mapping), a, (the width of these basis functions) and 


CLUSTERING 249 


à (the regularization term). For inadequate choices, the GTM may 
reflect the structure in the data very poorly. The SOM has the same 
problem, but needs to estimate somewhat less parameters. 

An advantage of the GTM over the SOM is that the parameters that 
need to be set by the user have a clear interpretation, unlike in the SOM 
where unintuitive parameters as learning parameters and time para- 
meters have to be defined. Furthermore, the end result is a probabilistic 
model, which can easily be compared to other models (e.g. in terms of 
likelihood) or combined with them. 

Figure 7.12 shows some examples of GTMs (D = 1) trained on uni- 
formly distributed data. Figure 7.12(a) clearly shows how the GTM can 
be overtrained if too many basis functions — here, 10 — are used. In 
Figure 7.12(b), less basis functions are used and the manifold found is 
much smoother. Another option is to use regularization, which also 
gives a more smooth result as shown in Figure 7.12(c) but cannot 
completely prevent extreme nonlinearities. 

Listing 7.9 shows how a GTM can be trained and displayed in 
PRTools. 


Listing 7.9 
PRTools code for training and plotting generative topographic mapping. 


z=rand(100,2); % Generate the data set z 
w=gtm(z,15); % Train a 1D GTM and show it 
figure; clf; scatterd(z); plotgtm(w) ; 









































Figure 7.12 Trained generative topographic mappings. (a) K = 14, M = 10,0, = 0.2 
and A = 0. (b) K = 14,M = 5,0, = 0.2 and A = 0. (c) K = 14, M = 10, o, = 0.2 and 
A= 0.01 
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7.4 EXERCISES 


1. Generate a two-dimensional data set z uniformly distributed between 0 and 1. Create a 
second data set y uniformly distributed between —1 and 2. Compute the (Euclidean) 
distances between z and y, and find the objects in y which have distance smaller than 1 
to an object in z. Make a scatter plot of these objects. Using a large number of objects 
in z, what should be the shape of the area of objects with distance smaller than 1? 
What would happen if you change the distance definition to the city-block distance 
(Minkowski metric with g = 1)? And what would happen if the cosine distance is 
used? (0) 

2. Create a data set z = gendatd(50, 50, 4,2); . Make a scatter plot of the data. Is the 
data separable? Predict what would happen if the data is mapped to one dimension. 
Check your prediction by mapping the data using pca(z,1), and training a simple 
classifier on the mapped data (such as 1dc). (0) 

3. Load the worldcities data set and experiment with using different values for q in 
the MDS criterion function. What is the effect? Can you think of another way of 
treating close sample pairs different from far-away sample pairs? (0) 

4. Derive equations (7.21), (7.23) and (7.27). (x) 

5. Discuss in which data sets it can be expected that the data is distributed in a subspace, 
or in clusters. In which cases will it not be useful to apply clustering or subspace 
methods? (x) 

6. What is a desirable property of a clustering when the same algorithm is run multiple 
times on the same data set? Develop an algorithm that uses this notion to estimate the 
number of clusters present in the data. (**) 

7. In terms of scatter matrices (see the previous chapter), what does the K-means algo- 
rithm minimize? (0) 
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8. 


9. 


10. 


Under which set of assumptions does the EM algorithm degenerate to the K-means 
algorithm? (*«) 

What would be a simple way to lower the risk of ending up in a poor local minimum 
with the K-means and EM algorithms? (0) 

Which parameter(s) control(s) the generalization ability of the self-organizing map 
(for example, its ability to predict the locations of previously unseen samples)? And 
that of the generative topographic mapping? (+) 


8 


State Estimation in Practice 


Chapter 4 discussed the theory needed for the design of a state estimator. 
The current chapter addresses the practical issues related to the design. 
Usually, the engineer cycles through a number of design stages of which 
some are depicted in Figure 8.1. 

One of the first steps in the design process is system identification. The 
purpose is to formulate a mathematical model of the system of interest. 
As stated in Chapter 4, the model is composed of two parts: the state 
space model of the physical process and the measurement model of the 
sensory system. Using these models, the theory from Chapter 4 provides 
us with the mathematical expressions of the optimal estimator. 

The next questions in the design process are the issues of observa- 
bility (can all states of the process be estimated from the given set of 
measurements?) and stability. If the system is not observable or not 
stable, either the model must be revised or the sensory system must be 
redesigned. 

If the design passes the observability and the stability tests, the 
attention is focussed at the computational issues. Due to finite arith- 
metic precision, there might be some pitfalls. Since in state estimation 
the measurements are processed sequentially, the effects of round-off 
errors may accumulate and may cause inaccurate results. The estima- 
tor may even completely fail to work due to numerical instabilities. 
Although the optimal solution of an estimation problem is often 
unique, there are a number of different implementations which are 
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Figure 8.1 Design stages for state estimators 


all mathematically equivalent, and thus all representing the same 
solution, but with different sensitivities to round-off errors. Thus, in 
this stage of the design process the appropriate implementation must 
be selected. 

As soon as the estimator has been realized, consistency checks must be 
performed to see whether the estimator behaves in accordance with the 
expectations. If these checks fail, it is necessary to return to an earlier 
stage, i.e. refinements of the models, selection of another implementa- 
tion, etc. 

Section 8.1 presents a short introduction to system identification. 
The topic is a discipline on its own and will certainly not be covered 
here in its full length. For a full treatment we refer to the pertinent 
literature (Box and Jenkins, 1976; Eykhoff, 1974, Ljung and Glad, 
1994; Ljung, 1999; Söderström and Stoica, 1989). Section 8.2 dis- 
cusses the observability and the dynamic stability of an estimator. 
Section 8.3 deals with the computational issues. Here, several imple- 
mentations are given each with its own sensitivities to numerical 
instabilities. Section 8.4 shows how consistency checks can be accom- 
plished. Finally, Section 8.5 deals with extensions of the discrete 
Kalman filter. These extensions make the estimator applicable to a wider 
class of problems, i.e. non-white/cross-correlated noise sequences and 
offline estimation. 

Some aspects of state estimator design are not discussed in this book; 
for instance sensitivity analysis and error budgets (Gelb etal., 1974). 
These techniques are systemic methods for the identification of the most 
vulnerable parts of the design. 

Most topics in this chapter concern Kalman filtering as introduced in 
Section 4.2.1, though some are also of relevance for extended Kalman 
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filtering. For the sake of convenience, the equations are repeated here. 
The point of departure in Kalman filtering is a linear-Gaussian model of 
the physical process: 


x(i+ 1) = F(i)x(i)+ L(i)u(i) + w(t) i=0,1,... (state equation) 
z(i) = H(i)x(i) + v(i) (measurement model) 
(8.1) 


x(i) is the state vector with dimension M. z(i) is the measurement vector 
with dimension N. The process noise w(i) and measurement noise v(i) 
are white Gaussian noise sequences, zero mean, and with covariance 
matrix C,(i) and C,(i), respectively. Process noise and measurement 
noise are uncorrelated: Cy,(i) = 0. The prior knowledge is that x(0) 
has a Gaussian distribution with expectation E[x(0)] and covariance 
matrix C,(0). 

The MMSE solution to the online estimation problem is developed in 
Section 4.2.1, and is known as the discrete Kalman filter. The solution is 
an iterative scheme. Each iteration cycles through (4.27) and (4.28), 
which are repeated here for convenience: 


update : 
2(1) = H(z)x(a|7— 1) (predicted measurement) 
S(i) = H(i)C(ii— 1)HT (i) +C\() (innovation matrix) 
K(i) = C(iji— 1)H" (HS7 (i) (Kalman gain matrix) 
x(i]i) =x(ili—- 1) +K()(z()—2(@)) (updated estimate) 
C(ili) = C(ili— 1) — K() SK" (1) (error covariance matrix) 
prediction: 
X(i-+ 1 |i) = F(A) x(a’) + Luli) (prediction) 
C(i+ 1|i) =F(i)C(i|i)F (i) + Cy (2) (predicted state covariance) 








(8.2) 


The iterative procedure is initiated with the prediction for i = 0 set equal 
to the prior: 


x(0|—1)“E|[x(0)] and C(0|-1)“C, (0). 
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8.1 SYSTEM IDENTIFICATION 


System identification is the act of formulating a mathematical model of a 
given dynamic system based on input and output measurements of that 
system, and on general knowledge of the physical process at hand. The 
discipline of system identification not only finds application in estimator 
design, but also monitoring, fault detection and diagnosis (e.g. for 
maintenance of machines) and design of control systems. 

A dichotomy of models exists between parametric models and non- 
parametric models. The nonparametric models describe the system by 
means of tabulated data of, for instance, the Fourier transfer function(s) 
or the edge response(s). Various types of parametric models exist, e.g. 
state space models, poles-zeros models and so on. In our case, state space 
models are the most useful, but other parametric models can also be used 
since most of these models can be converted to a state space. 

The identification process can roughly be broken down into four parts: 
structuring, experiment design, estimation, evaluation and selection. 


8.1.1 Structuring 


The first activity is structuring. The structure of the model is settled by 
addressing the following questions. What is considered part of the system 
and what is environment? What are the (controllable) input variables? What 
are the possible disturbances (process noise)? What are the state variables? 
What are the output variables? What are the physical laws that relate the 
physical variables? Which parameters of these laws are known, and which 
are unknown? Which of these parameters can be measured directly? 

Usually, there is not a unique answer to all these questions. In fact, the 
result of structuring is a set of candidate models. 


Example 8.1 Candidate models describing a simple hydraulic system 
The hydraulic system depicted in Figure 8.2 consists of two identical 
tanks connected by a pipeline with flow g;(t). The input flow qo(t) is 
acting on the first tank. g(t) is the output flow from the second tank. 

The relation between the level and the net input flow of a tank is 
Coh = qot. C is the capacity of the tank. If the horizontal cross-sections 
of the tanks are constant, the capacity does not depend on h. In the 
present example, the capacity of both tanks is C = 420 (cm7). The order 
of the system is at least two; the two states being the levels h4 (t) and /2(t). 
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input flow __o Ah= hy — hp 


O= 





























tank 1 tank 2 drain 


Figure 8.2 A simple hydraulic system consisting of two connected tanks 


For the model of the flow through the pipelines we consider three 
possibilities, each leading to a different structure of the model. 


Candidate model I: Frictionless liquids; Torricelli’s law 

For a frictionless liquid, Torricelli’s law states that when a tank leaks, 
the sum of potential and kinetic energy is constant: q? = 2A7gh. A is 
the area of the hole. Application of this law gives rise to the following 
second order, nonlinear model (g is the gravitational constant): 


1 
2Aîg(hı — h2) + TL 


: | l (8.3) 
hı = TE 2Aîg(hı — h2) — TV 2A3gh2 


Candidate model II: Linear friction 

Here, we assume that the difference of pressure on both sides of a 
pipeline holds a linear relation with the flow: Ap = Rq. The para- 
meter R is the resistance. Since Ap = pgAh (p is the mass density), the 
assumption brings the following linear, second order model: 





. 1 
hy = Pe ( bo bi) + a0 








; — pg em) 
n= Rigi — ha) + Ree 


Candidate model III: Linear friction and hydraulic inertness 

A liquid within a pipeline with a length £ and cross-section A experi- 
ences a force pġ (second law of Newton: F = ma). This force is 
induced by the difference of pressure F = AAp = ApgAh. Thus, 
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lq = AgAh. With friction, the equation is ARq + pq = ApgAh. With 
that, we arrive at the following linear, fourth order model: 


hy oe mı mı hy ma —T1 0 hy 
bz C| m -=m -m || ho rn=rn =n |b 


where 
mı = gA1ı/h1, m = gA2lbh, rı = R14A1/(4p) and rz = R242/(42p). 


8.1.2 Experiment design 


The purpose of the experimentation is to enable the determination of the 
unknown parameters and the evaluation and selection of the candidate 
models. The experiment comes down to the enforcement of some input 
test signals to the system at hand and the acquisition of measured data. 
The design aspects of the experiments are the choice of input signals, the 
choice of the sensors and the sensor locations and the preprocessing of 
the data. 

The input signals should be such that the relevant aspects of the 
system at hand are sufficiently covered. The bandwidth of the input sig- 
nal must match the bandwidth of interest. The signal magnitude must be 
in a range as required by the application. Furthermore, the signal energy 
should be sufficiently large so as to achieve suitable signal-to-noise 
ratios at the output of the sensors. Here, a trade-off exists between the 
bandwidth, the registration interval and the signal magnitude. 

For instance, a very short input pulse covers a large bandwidth and 
permits a low registration interval, but requires a tremendous (perhaps 
excessive) magnitude in order to have sufficient signal-to-noise ratios. At 
the other side of the extremes, a single sinusoidal burst with a long 
duration covers only a very narrow bandwidth, and the signal magni- 
tude can be kept low yet offering enough signal energy. A signal, often 
used, is the pseudorandom binary signal. 

The choice of sensors is another issue. First, the set of identifiable, 
unknown parameters of the model must be determined. For instance, if 
in equation (8.4) p and R, are unknown, then these parameters are not 
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separately identifiable from the level measurements because these parameters 
always occur in the combination p/R1. Thus, in this case we can treat p/R1 as 
one identifiable parameter. Sometimes, it might be necessary to temporarily 
use additional (often expensive) sensors so as to enable the estimation of all 
relevant parameters. As soon as the system identification is satisfactorily 
accomplished, these additional sensors can be removed from the system. 
Often, the acquired data need preprocessing before the parameter esti- 
mation and evaluation take place. Reasons for doing so are, for instance: 


e If the bandwidth of the noise is larger than the bandwidth of 
interest, filtering can be applied to suppress the noise in the unim- 
portant frequency ranges. 

e If a linearized model is strived for, the unimportant offsets should 
be removed (offset correction, baseline removal). Often, this is done 
by subtraction of the average from the signal. 

e Sudden peaks (spikes) in the data are probably caused by disturb- 
ances such as mechanical shocks and electrical inferences due to 
insufficient shielding. These peaks should be removed. 


In order to prevent overfitting, it might be useful to split the data 
according to two time intervals. The data in the first interval is used 
for parameter estimation. The second interval is used for model evalu- 
ation. Cross-evaluation might also be useful. 


Example 8.2 Experimental data from the hydraulic system 

Figure 8.3 shows data obtained from the hydraulic system depicted in 
Figure 8.2. The data is obtained using two level sensors that measure 
the levels hı and h2. The sample period is A = 5 (s). The standard 
deviation of the sensor noise is about o, = 0.04 (cm). 

The measured levels in Figure 8.3 correspond to the free response of 
the system obtained with zero input and with an initial condition in 
which both tanks are completely filled, i.e. h1(0) = h2(0) = 25 (cm). 
Such an experiment is useful if it is envisaged that in the application 
this kind of level swings can occur. 


8.1.3 Parameter estimation 


Suppose that all unknown parameters are gathered in one parameter 
vector œ. The discrete system equation is denoted by f(x,w,a@). The 
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Figure 8.3 Experimental data obtained from the hydraulic system 


measurement system is modelled, as before, by z = h(x,v). We then have 
the sequence of measurements according to the following recursions: 


= ON) for G20 Ay cg 4 (8.6) 
x(i+ 1) = f(x(i), w(i), æ) 

Tis the length of the sequence. x(0) is the initial condition (which may be 
known or unknown). v(i) and w(i) are the measurement noise and the 
process noise, respectively. 

One possibility for estimating æ is to process the sequence z(i) in batches. 
For that purpose, we stack all measurement vectors to one I x N dimen- 
sional vector, say Z. Equation (8.6) defines the conditional probability 
density p(Z|æ). The stochastic nature of Z is due to the randomness of 
w(i), v(i) and possibly x(0). Equation (8.6) shows how this randomness 
propagates to Z. Once the conditional density p(Z|@) has been settled, the 
complete estimation machinery from Chapter 3 applies, thus providing the 
optimal solution of œ. Especially, maximum likelihood estimation is pop- 
ular since (8.6) can be used to calculate the (log-likelihood of œ. 
A numerical optimization procedure must provide the solution. 

Working in batches soon becomes complicated due to the (often) 
nonlinear nature of the relations involved in (8.6). Many alternative 
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techniques have been developed. For linear systems, processing in the 
frequency domain may be advantageous. Another possibility is to pro- 
cess the measurements sequentially. The trick is to regard the parameters 
as state vectors @(i). Static parameters do not change in time. So, the 
corresponding state equation is a@(i + 1) = a@(i). Sometimes, it is useful to 
allow slow variations in a@(i). This is helpful in order to model drift 
phenomena, but also to improve the convergence properties of the 
procedure. A simple model would be a process that is similar to random 
walk (Section 4.2.1): æ(i + 1) = æ(i) + @(i). The white noise sequence 
@(i) is the driving force for the changes. Its covariance matrix Cg should 
be small in order to prevent a too wild behaviour of a@(i). 
Using this model, equation (8.6) transforms into: 





+ D Pee | . 
= (state equation) 
æli+ 1) a(t) + ø@(i) (8.7) 
z(i) = h(x(i), v(4)) (measurement equation) 


The original state vector x(i) has been augmented with æ(i). The new 
state equation can be written as &(i+1) = @(E(i),w(i),@(i)) with 
éli) = [x7 (i) a7 (i) |’. This opens the door to simultaneous online 
estimation of both x(i) and a@(i) using the techniques discussed in 
Chapter 4. However, the new state function @(-) is nonlinear and for 
online estimation we must resort to estimators that can handle these 
nonlinearities, e.g. extended Kalman filtering (Section 4.2.2), or particle 
filtering (Section 4.4). 

Note that if @(i) = 0, then æ(i) is a random constant and (hopefully) its 
estimate @&(i) converges to a constant. If we allow @/(i) to deviate from zero 
by setting Co to some (small) nonzero diagonal matrix, the estimator 
becomes adaptive. It has the potential to keep track of parameters that drift. 


Example 8.3 Parameter estimation for the hydraulic system 

In order to estimate the parameters of the three models of the 
hydraulic system using the data from the previous example the 
following procedure was applied. First, a particle filter was executed 
in order to get a rough indication of the magnitudes of the parameters. 
Figure 8.4(a) shows the results of the filter applied to the Torricelli 
model. In this model, there are two parameters A; and A2 which were 
both initiated with a uniform distribution between 0 and 2(cm7). The 
parameters were modelled with A(i + 1) = A(i) + w(i) where w(i) is 
white noise with a standard deviation of 0.004(cm7). 
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Figure 8.4 Results of the estimation of the parameters of the hydraulic models. The 
dotted lines are the measurements. The solid lines are results from the model 


The second stage of the estimation procedure is a refinement of the 
parameters based on maximum likelihood estimation. Equation (8.6) 
was used to numerically evaluate the log-likelihood as a function of 
the parameters. A normal distribution of v(i) was assumed. Therefore, 
instead of the log-likelihood we can equivalently well calculate the 
sum of squared Mahalanobis distances: 


I 
J&(0),æ) =X (z(i) — &()) "Cy (zl) — 20) 
i=0 
with: &(i+ 1) = f(&(i),@) for i= 0,1,..., 1—1 


(8.8) 
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The minimalization of J(x(0),@) using the MATLAB function 
fminsearch from the Optimization Toolbox gives the final result. 
The numerical optimization is initiated with the parameters obtained 
from particle filtering. 

Figures 8.4(b), (c) and (d) show the estimated levels x(i) obtained 
with minimal J() for the three considered models. 


8.1.4 Evaluation and model selection 


The last step is the evaluation of the candidate models and the final 
selection. For that purpose, we select a quality measure and evaluate and 
compare the various models. Popular quality measures are the log-like- 
lihood and the sum of squares of the residuals (the difference between 
measurements and model-predicted measurements). 

Models with a low order are preferable because the risk of over- 
fitting is minimized and the estimators are less sensitive to estima- 
tion errors of the parameters. Therefore, it might be beneficial to 
consider not only the model with the best quality measure, but also 
the models which score slightly less, but with a lower order. In order 
to evaluate these models, other tests are useful. Ideally, the cross 
correlation between the residuals and the input signal is zero. If not, 
there is still part of the behaviour of the system that is not explained 
by the model. 


Example 8.4 Model selection for the hydraulic system 

Qualitative evaluation of the three models using the log-likelihood as 
a quality measure (or equivalently, the sum of squared Mahalanobis 
distances) yields the following results: 


Torricelli’s model: J = 4875 
Second order linear model: J = 281140 
Fourth order linear model: J = 62595 


Here, there is wide gap between the nonlinear Torricelli’s model and 
the two linear models. Yet, the quality measure J = 48735 is still too 
large to ascribe it fully to the measurement noise. Without the model- 
ling errors, the mean value of the sum of squared Mahalanobis dis- 
tances (I = 300, N = 2) is 600. Inspection of Figure 8.4(b) reveals 
that the feet of the curves cause the discrepancy. Perhaps the Torricelli 
model should be extended with a small linear friction term. 
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8.1.5 Identification of linear systems with a random input 


There is a rich literature devoted to the problem of explaining a random 
sequence x(i) by means of a linear system driven by white noise (Box, 
1976). An example of such a model is the autoregressive model intro- 
duced in Section 4.2.1. The Mth order AR model is: 


x(t) = 3 Anx(i — n) + w(i) (8.9) 
n=1 


This type of model is easily cast into a state space model. As such it can 
be used to describe non-white process noise. More general schemes are 
the autoregressive moving average (ARMA) models and the autoregres- 
sive integrating moving average (ARIMA) models. The discussion here is 
only introductory and is restricted to AR models. For a full treatment we 
refer to the pertinent literature. 

The identification of an AR model from an observed sequence x(i) 
boils down to the determination of the order M, and the estimation of 
the parameters a, and o2,. Assuming that the system is in the steady 
state, the estimation can be done by solving the Yule—Walker equations. 
These equations arise if we multiply (8.9) on both sides by 
x(i—1),...,x(i — M), and take expectations: 


+Elw(i)x(i-k)]| for kR=1,...,M 


Since E[x(i—k)w(i)]}=0 and ry“Efx(i)(i— k)/o2, equation (8.10) 


defines the following systems of linear relations (see also equation 
(4.21)): 


rı 1 n n rm-1 | [ ay 
r2 rı 1 n n rm-2 | | ay 
ry | = | n A rn > rma || a (8.11) 
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The parameters a, are found by estimating the correlation coefficients rz 
and solving (8.11). 

The parameter 02, is obtained by multiplying (8.9) by x(i) and taking 
expectations: 


M 
= F anore + 0%, (8.12) 
n=1 


Estimation of o2 and solving (8.12) gives us the estimate of o%,. 

The order of the system can be retrieved by a concept called the 
partial autocorrelation function. Suppose that an AR sequence x(i) has 
been observed with unknown order M. The procedure for the identifi- 
cation of this sequence is to first estimate the correlation coefficients rz 
yielding estimates 7,. Then, for a number of hypothesized orders 
M = 1,2,3,... we estimate the AR coefficients Op xy for k = 1,. „M 
(the abio M has been added to discriminate between sontricieuts of 
different orders). From these coefficients, the last one of each sequence, 
i.e. Ĝğg p is called the partial autocorrelation function. It can be 
proven that: 


aym=0 for M>M (8.13) 


Thus, the order M is determined by checking where ay, 4, drops down to 
near zero. 


Example 8.5 AR model of a pseudorandom binary sequence 
Figure 8.5(a) shows a realization of a zero mean, pseudorandom 
binary sequence. A discrete Markov model, given in terms of transi- 
tion probabilities (see Section 4.3.1), would be an appropriate model 
for this type of signal. However, sometimes it is useful to describe the 
sequence with a linear, AR model. This occurs, for instance, when the 
sequence is an observation of process noise in an (otherwise) linear 
plant. The application of a Kalman filter requires the availability of a 
linear model, and thus the process noise must be modelled as an AR 
process. 

Figure 8.5(b) shows the partial autocorrelation function obtained 
from a registration of x(i) consisting of 4000 samples (Figure 8.5(a) 
only shows the first 500 samples). The plot has been made using 
MATLAB’s function aryule from the Signal Processing Toolbox. 
Clearly, the partial autocorrelation function drops down at M = 2, 
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Figure 8.5 Modelling a pseudorandom binary signal by an AR process 


so the estimated order is M = 1, i.e. the best AR model is of first 
order. Figure 8.5(c) shows a realization of such a process. 


8.2 OBSERVABILITY, CONTROLLABILITY AND 
STABILITY 


8.2.1 Observability 


We consider a deterministic linear system: 


x(i+ n = Pay + L()u(i) (8.14) 

z(7) = H(i)x(i) 
The system is called observable if with known F(i), L(i)u(i) and H(z) the 
state x(i) (with fixed 7) can be solved from a sequence z(i),z(i+1),... of 
measurements. The system is called completely observable if it is observ- 
able for any i. In the following, we assume L(i)u(i) = 0. This is without 
any loss of generality since the influence to z(i),z(i+ 1),... of a L(i)u(i) 
not being zero can be neutralized easily. Hence, the observability of a 
system solely depends on F(i) and H(i). 
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An approach to find out whether the system is observable is to con- 
struct the observability Gramian (Bar-Shalom and Li, 1993). From (8.14): 


z(i) H(i)x(i) H(i) 
i+1) H(i + 1)x(i+1) H(i+ 1)F(é) . 
z(i+2) | = | WG+2)x(i+2) | = | WG+2)E()F(i+1) |X (8-15) 

















Equation (8.15) is of the type z = Hx. The least squares estimate is 
x = (H"H) 'H"z. See Section 3.3.1. The solution exists if and only if 
the inverse of H” H exists, or in other words, if the rank of HH is equal 
to the dimension M of the state vector. Equivalent conditions are that 
H"H is positive definite (i.e. y'H’ Hy > 0 for every y Æ 0), or that the 
eigenvalues of HH are all positive. See Appendix B.5. 

Translated to the present case, the requirement is that for at least one 
n > 0 the observability Gramian ©, defined by: 


1 T iy 
6 =H" (i) HO+ > (H vr] ee+8)) (muso ffres) 
k=0 


k=0 
(8.16) 


has rank equal to M. Equivalently we check whether the Gramian is 
positive definite. For time-invariant systems, F and H do not depend on 
time, and the Gramian simplifies to: 


6 = > CHP)" (HF’) (8.17) 


If the system F is stable (the magnitude of all eigenvalues of F are less 

than one), we can set n — co to check whether the system is observable. 
A second approach to determine the observability of a time-invariant, 

deterministic system is to construct the observability matrix: 


H 
HF 


M=| HF (8.18) 


HEM! 
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According to (8.15), x(i) can be retrieved from a sequence 
z(i),...,Z(i + M—1) if M is invertible; that is, if the rank of M 
equals M. 

The advantage of using the observability Gramian instead of the 
observability matrix is that the former is more stable. Modelling errors 
and round-off errors in the coefficients in both F and H could make the 
difference between an invertible © or M and a noninvertible one. How- 
ever, © is less prone to small errors than M is. 

A more quantitative measure of the observability is obtained by using 
the eigenvalues of the Gramian ©. A suitable measure is the ratio 
between the smallest eigenvalue and the largest eigenvalue. The system 
is less observable as this ratio tends to zero. A likewise result can be 
obtained by using the singular values of the matrix M (see singular value 
decomposition in Appendix B.6). 


Example 8.6 Observability of a second order system 
Consider the system (F, H) given by: 


0.66 0.12 11 
ee) ty 
0.32 0.74 3 4 


The rank of both the Gramian © and the observability matrix M 
appear to be one, indicating that the system is not observable. 
However, if the coefficients of F and H are represented in single 
precision IEEE floating point format, the relative round-off error is 
in the order of 1078. These round-off errors cause the ratio of 
eigenvalues of © to be in the order of 10716 instead of zero. The 
comparable ratio of singular values of M is in the order of 107°. 
Clearly, both ratios indicate that the observability is poor. However, 
the one of M is overoptimistic. In fact, if MATLAB’s function rank () 
is applied to © and M, the former returns one, while the latter 
(erroneously) yields two. The corresponding MATLAB code, given in 
Listing 8.1, uses functions from the Control System Toolbox. Espe- 
cially, the function ss() is of interest. It creates a state space 
model, i.e. a special structure array containing all the matrices of 
a linear time-invariant system. 
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Listing 8.1 
Two methods of obtaining an observability measure of a linear time- 
invariant system. 


F=[0.660.12; 0.320.74]; % Define the system 
H= [1/3 1/4]; 
Fs=double(single(F)); 
Hs = double(single(H)); 
B=[1; 0]; D=0; 
sys=ss(Fs,B,Hs,D,-1); 
=obsv(Fs,Hs) ; 
G=gram(sys,’0’); 
eigG=eig(G); 
svdM=svd(M); 
disp(‘ratio of eigenvalues of Gramian:’); 

min(eigG) /max(eigG) 

disp(‘ratio of singular values of observability matrix:’); 


Round-off to 32 bits 


oe 


ae 


Create state-space model 
Get observability matrix 
and Gramian 

Calculate eigenvalues 
and singular values 








œ Æ Æ 


oe 


min(svdM) /max (svdM) 


The concept of observability can also be extended such that the influence 
of measurement noise is incorporated. The stochastic observability has a 
strong connection with a particular implementation of the Kalman filter, 
known as the information filter. The details of this extension will follow 
in Section 8.3.3. 


8.2.2 Controllability 


In control theory, the concept of controllability usually refers to the 
ability that for any state x(i) at a given time i a finite input sequence 
u(i), u(i + 1),..., u(i +n — 1) exists that can drive the system to an arbi- 
trary final state x(i+ n). If this is possible for any time, the system is 
called completely controllable.’ As with observability, the controllability 
of a system can be revealed by checking the rank of a Gramian. For 
time-invariant systems, the controllability can also be analysed by means 
of the controllability matrix. This matrix arises from the following 
equation: 





1 Some authors use the word ‘reachability’ instead, and reserve the word ‘controllability’ for a 
system that can be driven to zero (but not necessarily to an arbitrary state). See Amstrém and 
Wittenmark (1990). 
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x(i + 1) = Fx(a) + Lu(z) 
x(i + 2) = F’x(é) + FLu(i) + Lu(i+ 1) 
x(i+ 3) = F°x(i) + FLu() + FLu(é+ 1) + Lu(i+ 2) 
(8.19) 
x(i +n) = F"x(i) + 3 F/Lu(i+/) 
j=0 
or: 
u(i) 
u(i+ 1) 
[L FL 442-8“ E] = x(i+ n) — F”x(i) (8.20) 
u(i + n - 1) 


The minimum number of steps, n, is at most equal to M, the dimension 
of the state vector. Therefore, in order to test the controllability of the 
system (F, L) it suffices to check whether the controllability matrix 
[L FL... FİL] has rank M. 

The MATLAB functions for creating the controllability matrix and 
Gramian are ctrb() and gram(), respectively. 


8.2.3 Dynamic stability and steady state solutions 


The term stability refers to the ability of a system to resist to and recover 
from disturbances acting on this system. A state estimator has to face 
three different causes of instabilities: sensor instability, numerical 
instability and dynamic instability. 

Apart from the usual sensor noise and sensor linearity errors, a sensor 
may produce unusual glitches and other errors caused by hard to predict 
phenomena, such as radio interference, magnetic interference, thermal 
drift, mechanical shocks and so on. This kind of behaviour is sometimes 
denoted by sensor instability. Its early detection can be done using 
consistency checks (to be discussed in Section 8.4). 

Numerical instabilities originate from round-off errors. Particularly, 
the inversion of a near singular matrix may cause large errors due to its 
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sensitivity to round-off errors. A careful implementation of the design 
must prevent these errors. See Section 8.3. 

The third cause for instability lies in the dynamics of the state 
estimator itself. In order to study the dynamic stability of the state 
estimator it is necessary to consider the estimator as a dynamic system 
with as inputs the measurements z(i) and the control vectors u(i). See 
Appendix D. The output consists of the estimates x(i|i). In linear 
systems, the stability does not depend on the input sequences. For the 
stability analysis it suffices to assume zero z(i) and u(i). The equations 
of interest are derived from (8.2): 


X(i+ 1)i+ 1) = (1— K()H())F(i — HRl (8.21) 
with: 


K(i) = POHTO (H() PH") +C) 


P(i) is the covariance matrix C(i+ 1|i) of the predicted state x(i+ 1|j). 
P(i) is recursively defined by the discrete Ricatti equation: 


PG + 1) =F(i)P(i)F! (i) + Cy (i) 


— F()P()H" (i) (H()P()H" (i) + Cy) HOPE" G) 
(8.22) 


The recursion starts with P(0) C,(0). The first term in (8.22) repre- 


sents the absorption of uncertainty due to the dynamics of the system 
during each time step (provided that F is stable; otherwise it represents 
the growth of uncertainty). The second term represents the additional 
uncertainty at each time step due to the process noise. The last term 
represents the reduction of uncertainty thanks to the measurements. 
For the stability analysis of a Kalman filter it is of interest to know 
whether a sequence of process noise, w(i), can influence each element of 
the state vector independently. The answer to this question is found by 
writing the covariance matrix Cy(i) of the process noise as 
Cy (i) = G(i)G" (i). Here, G(i) can be obtained by an eigenvalue/ 
eigenvector na nae of Cẹ(i). That is Cẹ(i) = Vw(i)Aw(i) Vi (i) 
and G(i) = Vw(i)Al(i). See Appendix B.S and C.3. Aw(i) is a K x K 
diagonal matrix where K is the number of nonzero eigenvalues of 
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Cy, (i). The matrices G(i) and Vy(i) are M x K matrices. The introduction 
of G(i) allows us to write the system equation as: 


x(i + 1) = F(i)x(i) + Luli) + GOW) (8.23) 


where w(i) is a K dimensional Gaussian white noise vector with covari- 
ance matrix I. The noise sequence w(i) influences each element of x(#) 
independently if the system (F(i),G(i)) is controllable by the sequence 
w(i). We then have the following theorem (Bar-Shalom, 1993), which 
provides a sufficient (but not a necessary) condition for stability: 


If a time invariant system (F,H) is completely observable and the 
system (F, G) is completely controllable, then for any initial condition 
P(0) = C,(0) the solution of the Ricatti equation converges to a 
unique, finite, invertible steady state covariance matrix P(oo). 


The observability of the system assures that the sequence of measure- 
ments contains information about the complete state vector. Therefore, 
the observability guarantees that the prediction covariance matrix P(oo) 
is bounded from above. Consequently, the steady state Kalman filter is 
BIBO stable. See Appendix D.3.2. If one or more state elements are not 
observable, the Kalman gains for these elements will be zero, and the 
estimation of these elements is purely prediction driven (model based 
without using measurements). If the system F is unstable for these elem- 
ents, then so is the Kalman filter. 

The controllability with respect w(i) assures that P(oo) is unique ( does 
not depend on C,(0)), and that its inverse exists. If the system is not 
controllable, then P(oo) might depend on the specific choice of C,(0). If 
for the non-controllable state elements the system F is stable, then the 
corresponding eigenvalues of P(oo) are zero. Consequently, the Kalman 
gains for these elements are zero, and the estimation of these elements is 
again purely prediction driven. 

The discrete algebraic Ricatti equation, P(i+ 1) = P(i), mentioned in 
Chapter 4, provides the steady state solution for the Ricatti equation: 


P = FPF" + Cw — FPH” (HPH? + C,) HPT? (8.24) 


The steady state is reached due to the balance between the growth of 
uncertainty (the second term and possibly the first term) and the reduc- 
tion of uncertainty (the third term and possibly the first term). 
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A steady state solution of the Ricatti equation gives rise to a constant 
Kalman gain. The corresponding Kalman filter, derived from (8.2) and (8.21): 


X(i+ 1]i+ 1) = (I — KH)Fx(i{/) + (I — KH)Lu(’) + Kz(i+ 1) (8.25) 


with: 
K(i) = PH’ (HPH? +C,) 


becomes time invariant. 


Example 8.7 Stability of a system that is not observable 
Consider the system (F, H, Cw, Cy) given by: 


0.66 0.32 
Ble oA H=(12 2.4] Ges |, i Ê; 


This system is controllable, but not observable. Yet, the Ricatti equa- 
tion is asymptotically stable with steady state solution: 


p- [ 13110 —0.0822 
~ | 0.0822 1.1293 


Jang eigenvalues 1.0976 and 1.3427 


The eigenvalues of the corresponding steady state Kalman filter, i.e. 
the eigenvalues of (I — KH)F, are 0.5 and 0.101. Thus the Kalman 
filter is stable. 

Listing 8.2 provides the MATLAB code for this example. The func- 
tion dlqe() returns the Kalman gain matrix together with the predic- 
tion covariance, the error covariance and the eigenvalues of the 
Kalman filter. Alternatively, we use the function kalman () that cre- 
ates the estimator as a state space model. Internally, it uses the 
function dare () which solves the discrete algebraic Ricatti equation. 
The functions are from the Control System Toolbox. 


The explanation for the stability in the last example is as follows. 
Suppose that the measurements are switched off, that is, K = 0. In that 
case, the estimates just follow the dynamics of the system: 
x(i+ 1|i+ 1) = Fx(ii) + Lu(i). Since F is stable, the initial uncertainty 
C,(0) will be absorbed. In the steady state, the final uncertainty is given 
by the balance: 


C, (00) = FC,(00)F? + Cw (8.26) 
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(the discrete Lyapunov equation). Since the Kalman filter is optimal, the 
steady state solution of the Ricatti equation is bounded from above by 
Cx(oo), and thus asymptotical stable. The MATLAB function dlyap () 
returns the solution of the discrete Lyapunov equation. 


Example 8.8 Stability of a system that is not observable (continued) 
A further inspection of the situation confirms the statement made 
above. The solution of the discrete Lyapunov equation is: 


_ | 4.0893 2.3094 


Cx() = | 5 3994 3.2472 


with eigenvalues 1.32 and 6.02 


Indeed, P < C,(co) (meaning that the difference C, (oo) — P is positive 
semidefinite; i.e. possesses only non-negative eigenvalues). The eigen- 
values of F are 0.5 and 0.9. The eigenvalue of 0.5 corresponds to the 
state in the diagonalized system (Appendix D.3.1) that is not observed 
by the measurements. However, this state is stable. The Kalman gain 
for this state is zero, and thus the steady state Kalman filter copies this 
eigenvalue. 

If the second eigenvalue of F is increased from 0.9 to, say, 1.5, the 
system is not stable anymore, nevertheless the steady state solution of 
the Ricatti equation exists. The corresponding Kalman filter is stable. 
However, if the first eigenvalue of F is increased from 0.5 to 1.5, the 
system is again not stable. But this time, the corresponding Kalman 
filter isn’t stable either. The Ricatti equation is not stable anymore. 


Example 8.9 Stability of a system that is not controllable 
Consider the system (F, H, Cw, Cy) given by: 


ae 5) z = [onn 0.0833 
~ 10.12 0.74 “10.0833 0.0625 


H=(1 1] c, = [1] 


This system is observable. The covariance matrix of the process noise 
can be written as: Cy = GG! with G? = [0.333 0.25]. The system 
(F, G) is not controllable as a simple test can show. The steady state 
solution of the Ricatti equation is: 
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P= a eee with eigenvalues 0 and 0.338 


P is not invertible. The eigenvalues of F are 0.9 and 0.5. The Kalman 
filter, with eigenvalues 0.5411 and 0.5, is stable. 

The explanation of this behaviour is as follows. The diagonalized 
system has one controllable state (corresponding to an eigenvalue of 
0.9). For this state, the Kalman filter behaves regularly. The second 
state (with eigenvalue 0.5) is not controllable. This state is not 
affected by the process noise. It is a stable state, and thus, the initial 
uncertainty fades out. The zero variance of this state causes a zero 
eigenvalue in C,(oo). With that, the Kalman gain for that state also 
becomes zero because without uncertainty there is no need for 
measurements. Consequently, the eigenvalue of the system repeats 
itself in the Kalman filter. The zero eigenvalue of C,(oo) causes a 
corresponding zero eigenvalue in P. Thus, this matrix is not invertible. 

If the second eigenvalue of F is increased from 0.5 to 1.5, the initial 
condition C,(0) influences the long term behaviour of C,(i). If 
C,(0) = 0, then C,(i) converges to a constant. But this solution is 
not stable. A small perturbation of C,(0) causes C,(i) to diverge to 
infinity. Small perturbations of C,(0) trigger P(i) to follow quite 
different trajectories, but they finally converge to a nonzero steady 
state for which the Kalman filter is stable. 


The last example shows that if a system (F,G) is not controllable, 
some eigenvalues of the prediction covariance matrix may become zero. 
The matrix P is positive semidefinite. Such a situation does not contrib- 
ute to the numerical stability. 


Listing 8.2 
Steady state solution of a system that is not observable. 


lambda=diag([0.90.5]); 
V= [1/3 -1; 1/41/2]; 
F=V*lambda*inv(V); 
H=inv(V); H(2,:)=[]; 


ae 


Define asystemwith 
eigenvalues 0.9 and0.5 


ae 


Define a measurement matrix 

that only observes one state 
Cv=eye(1); Cw=eye(2); % Define covariance matrices 

% Discrete steady state Kalman filter: 

[M,P,Z,E] =dlqe(F,eye(2),H,Cw,Cv) ; 

Cx_inf =dlyap (F,Cw) % Solution of discrete Lyapunov equation 


ae 


ae 
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disp(‘Kalman gain matrix’); disp (M); 
disp(‘Eigenval. of Kalman filter’); disp(E); 
disp(‘Error covariance’); disp(Z); 
disp(‘Prediction covariance’); disp(P); 
disp(‘Eigenval. of prediction covariance’); disp(eig(P)); 
disp(‘Solution of discrete Lyapunov equation’); disp(Cx_inf); 
disp(‘Eigenval. of sol. 

of discrete Lyapunov eq.’); disp(eig 

(Cx int)-) + 


8.3 COMPUTATIONAL ISSUES 


A straightforward implementation of the time-variant Kalman filter may 
result in too large estimation errors. The magnitudes of these errors are 
not compatible with the error covariance matrices. The filter may even 
completely diverge even though theoretically the filter should be stable. 
This anomalous behaviour is due to a number representation with 
limited precisions. In order to find where round-off errors have the 
largest impact it is instructive to rephrase the Kalman equations in 
(8.2) as follows: 


Ricatti loop: 


C(il’) = C(é]i— 1) — Chili — 1)H" (HC (ii — 1)H™ + C,)~! HC(ii— 1) 
C(i+ 1]’) = FC(iji)F + Cy 


ee 


K(i) = C(iji— 1)H7S~1(;) (8.27) 
S(i) = HC(ili — 1)HT + C, 


= 


estimation loop: 


X(i|i) = X(ili— 1) + K(@)(2() — HX(i|i — 1)) 
x(i + 1ļi) = Fx(ié) + Lu(i) 


For simplicity, the system (F, L, H) is written without the time index. 
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As can be seen, the filter consists of two loops. If the Kalman filter is 
stable, the second loop (the estimation loop) is usually not that sensitive 
to round-off errors. Possible induced errors are filtered in much the same 
way as the measurement noise. However, the loop depends on the 
Kalman gains K(i). Large errors in K(i) may cause the filter to become 
unstable. These gains come from the first loop. 

In the first loop (the Ricatti loop) the prediction covariance matrix 
P(i) = C(i + 1]Z) is recursively calculated. As can be seen, the recursion 
involves nonlinear matrix operations including a matrix inversion. Espe- 
cially, the representation of these matrices (and through this the Kalman 
gain) may be sensitive to the effects of round-off errors. 

The sensitivity to round-off errors becomes apparent if an eigenvalue- 
eigenvector decomposition (Appendix B.5 and C.3.2) is applied to the 
covariance matrix P: 


M 
Bas Navan, (8.28) 
m=1 


Am are the eigenvalues of P and v,, the corresponding eigenvectors. The 
eigenvalues of a properly behaving covariance matrix are all positive (the 
matrix is positive definite and non-singular). However, the range of the 
eigenvalues may be very large. This finds expression in the condition 
number Amax/Amin Of the matrix. Here, Amax = max(A,) and 
Amin = Min (Àm). A large condition number indicates that if the matrix 
is inverted, the propagation of round-off errors will be large: 


Pt = S E NN (8.29) 


In a floating point representation of P, the exponents are largely deter- 
mined by Amax. Therefore, the round-off error in Amin is proportional to 
Amax, and may be severe. It will result in large errors in 1/Amin- 

Another operation with a large sensitivity to round-off errors is the 
subtraction of two similar matrices. 

These errors can result in a loss of symmetry in the covariance 
matrices and a loss of positive definiteness. In some cases, the eigenval- 
ues of the covariance matrices can even become negative. If this occurs, 
the errors may accumulate during each recursion, and the process may 
diverge. 
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As an example, consider the following Ricatti loop: 


1 


S(i) = HC(ii — 1)H7 + C, 
K(i) = (HCGi— 1) SO (8.30) 
C(ild = C(ili — 1) — K()HC(ili — 1) 
) 


| 
C(i+ 1ļi) = FC(ili)F7 + Cw 

This loop is mathematically equivalent to (8.2), but is computationally 
less expensive because the factor HC(i|i— 1) appears at several places 
and can be reused. However, the form is prone to round-off errors: 


e The part K(i)H can easily introduce asymmetries in C(i|/). 
e The subtraction I — K(i)H can introduce negative eigenvalues. 


The implementation should be used cautiously. The preferred implemen- 
tation is (8.2): 


C(ili) = Chili — 1) — K(i)S(i)K" (i) (8.31) 


It requires more operations than expression (8.30), but it is more 
balanced. Therefore, the risk of introducing asymmetries is much lower. 
Still, this implementation does not guarantee non-negative eigenvalues. 

Listing 8.3 implements the Kalman filter using (8.31). The functions 
acquire_measurement_vector() and get_control_vector () 
are placeholders for the interfacing to the measurement system and a 
possible control system. These functions should also take care of the 
timing of the estimation process. The required number of operations 
(using MATLAB’s standard matrix arithmetic) of the update step is about 
2M?N + 3MN? + N?. The number of the operations of the prediction 
step is about 2M. 





Listing 8.3 
Conventional time variant Kalman filter for a time invariant system. 


ae 


load linsys; Load a system: F,H,Cw,Cv,Cx0,x0,L 


CS CXU; % Initialize prior uncertainty 
xprediction=x0; % and prior mean 
while (1) % Endless loop 

% Update: 


Innovation matrix 
Kalman gain matrix 


ae 


S=H*C*H’ +Cv; 
K=C*H’ *S*-1; 





ae 
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Z=acquire_measurement_vector(); 
innovation=z - H*xprediction; 
xestimation=xprediction+K*innovation; 
C=C KISIK 
% Prediction: 
u=get_control_vector(); 
xprediction=F*xestimation+L*u; 
CSHF*YC*R + Cw; 

end 


Example 8.10 Numerical instability of the conventional Kalman filter 
Consider the system (F, H, Cw, Cy, C,(0)) given by: 


0.9698 0.1434 —0.1101 100 0 O 
F=|—0.1245 0.9547 0.2557 | C,(0)=] 0 100 0 
0.1312 —0.2455 0.9557 0 0 100 

8.1174 

Cw=AA" with A= | 3.9092 

4.3388 
0.0812 0.0391 0.0434 106 0 0 
H= | —0.4339 0.9010 0 CG=| 0 10% 0 
—0.3909 —0.1883 0.9010 0 oO 10% 


The eigenvalues of F are 0.999 exp ( — 0.1/7), 0.999 exp (0.1/7) and 0.98 
where j = /—1. The magnitude of the first two eigenvalues are close to unit, 
and therefore, the system is just stable. A diagonalization of the system would 
reveal that the process noise only affects the state that corresponds with the 
eigenvalue 0.98. Hence, the system is not controllable by the process noise. 
The measurement matrix is such that it measures the diagonalized states. 

An implementation of (8.30) for this system yields results as shown in 
Figure 8.6. The results are obtained by means of a simulation of the 
system and a 32-bit IEEE floating point implementation of the Kalman 
filter (including the Ricatti loop). The relative round-off errors are of the 
order 1078. The Kalman filter appears to be unstable due to the negative 
eigenvalues of the covariance matrix. The same implementation using 
64-bit double precision (relative error is around 10716) yields stable results. 
Application of the code in Listing 8.3, but performed with 32 bit precision 
gives the results shown in Figure 8.7. Although the filter is stable now, it 
still doesn’t work properly. 
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x 10 minimum eigenvalue of C(iļi) 
5, measurements 
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Figure 8.6 Results of a computationally efficient implementation of the conven- 
tional Kalman filter. The filter is unstable due to an eigenvalue of P that remains 
negative 


The symmetry of a matrix P can be enforced by the assignment: 
P := (P + P™)/2. The addition of this type of statement at some critical 
locations in the code helps a little, but often not enough. The true 
remedy is to use a proper implementation combined with sufficient 
number precision. The remaining part of this section discusses a number 
of different implementations. 


8.3.1 The linear-Gaussian MMSE form 


In Section 3.1.3 we derived the MMSE estimator for static variables in 
the linear-Gaussian case. In section 3.1.5, the unbiased linear MMSE 
estimator was derived. Since the MMSE solution, expressed in equation 
(3.20), appeared to be linear and unbiased, the conclusion was drawn 
that this solution is identical to the unbiased linear MMSE solution 
(given in (3.33) and (3.45)). However, the two solutions have different 
forms. We denote the first solution by the (linear-Gaussian) MMSE 
form. The second solution is the Kalman form. The equivalence between 
the two solutions can be shown by using the matrix inversion lemma, 


(b.10). 
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x 10+ minimum eigenvalue of C(ili) 
15 ; measurements 
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Figure 8.7 Results of a balanced implementation of the conventional Kalman filter. 
The filter is stable, but still an eigenvalue of C(i|i) is sometimes negative 


As a result of all this, the Kalman filter can also be expressed in two 
forms. The following implementation is based on the MMSE form, i.e. 
equation (3.20): 


update : 

Clili) = (C~! (iji — 1) + HTC7'H) * (error covariance 
matrix) 

ZX(ili) = C(A (C7 (ii — 1)X(ili— 1) + H™C,'z(’)) (updated estimate) 


(8.32) 


The Kalman form, given in (8.2), requires the inversion of S(i), an N x N 
matrix. The MMSE form requires the inversion of C(iji— 1) and 
Ct(i— 1i) + H'C ~'H; both are M x M matrices. It also requires the 
inversion of Cy, but this can be done outside the Ricatti loop. Besides, Cy 
is often a diagonal matrix (uncorrelated measurement noise), whose 
inversion is without problems. The situation where Cy is not invertible 
is a degenerated case. One or more measurements are completely correl- 
ated with other measurements. Such a measurement can be removed 
without loss of information. 

In the time-invariant case, the calculation of the term H’C,'H, 
appearing in (8.32), can be kept outside the loop. If so, the number of 
operations is 2M? + M? + M. Thus, if there are many measurements and 
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only a few states, i.e. N >> M, the a priori form might be favourable. But 
in other cases, the Kalman form is often preferred. 


Example 8.11 The linear-Gaussian MMSE form 

Application of the MMSE form to the same system and data as in 
Example 8.10 yields the results as shown in Figure 8.8 (using 32-bit 
floating point number representations). At first sight, there seems 
nothing wrong with these results. However, the sudden change of 
the smallest eigenvalue at i= 2, from which point on it remains 
constant, is reason to become suspicious. 


8.3.2 Sequential processing of the measurements 


The conventional Kalman filter processes the measurement data in blocks. 
The data z,(i) of the sensors available at time į are collected in the 
measurement vector z(i), and processed as one unit. Another possibility 
is to process the individual measurements sequentially. A requirement is 
that the measurement noise is uncorrelated. C, must be a diagonal matrix. 
If not, the measurement vector must be decorrelated first using techniques 
as described in Appendix C.3.1 (Kaminski etal., 1971). In the following 
algorithm, the diagonal elements of Cy are denoted by o2. The row vector 
h, stands for the n-th row of H. Thus, H? = [hi awe hy ]. 
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Figure 8.8 Results of the Kalman filter implemented in the MMSE form 
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Algorithm 8.1: Sequential update 
1. Initialization: 


e x(i,0) & xX(ii—1) 
e C(i,0) & Chili- 1) 


2. Sequential update: 
For n = 0,1,2,...,N— 1: 


e s(i,n)=h,C(in)h! + 02 
> AT 
e k,= Coon (Kalman gain vector) 
s(i,7) 
e x(in+1)=x(i,2) +k, (z,(i) —h,x(i,2)) (update of the estimate) 
C(i,n)h h,,C(i,7) 


s(i,7) 


(innovation variance) 


e Clint1)=C(i,n) 





(update of the covariance) 
3. Closure: 


e x(i|/) = x(i,n) 


e C(ili) = Cli, n) 


The number of required operations is about 2M?N + 2MN. It outper- 
forms the conventional Kalman filter. However, the subtraction that is 
needed to get C(i,2+1) can introduce negative eigenvalues. Conse- 
quently, the algorithm is still sensitive to round-off errors. 


Example 8.12 Sequential processing 

The results obtained by the application of sequential processing of the 
measurements to the problem from Example 8.10 is shown in 
Figure 8.9. Negative eigenvalues are not prevented, and the filter does 
not behave correctly. 


8.3.3 The information filter 


In the conventional Kalman filter it is difficult to represent a situation in 
which no knowledge is available about (a subspace of) the state. It would 
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Figure 8.9 Sequential processing of the measurements 


require that some eigenvalues of the corresponding error covariance 
matrix would be infinity. The concept of an information matrix circum- 
vents this problem. 

An information matrix is the inverse of a covariance matrix. If one or 
more eigenvalues of an information matrix are small, then a subspace 
exists in which the uncertainty of the random vector is large. In fact, if 
one or more eigenvalues are zero, then no knowledge exists about the 
random vector in the subspace spanned by the corresponding eigenvec- 
tors. In this situation, the covariance matrix does not exist because the 
information matrix is not invertible. 

The information filter is an implementation of the Kalman filter in 
which the Ricatti loop is entirely expressed in terms of information 
matrices (Grewal and Andrews, 2001). Let Y(ż¿|j) be the information 
matrix corresponding to C(ż|j). Thus: 


a n def ngpa 
Y(ij)=C*(i/) (8.33) 
Using (8.32), the update in the Kalman filter is rewritten as: 
update: 
Y(ili) = Y(iji—1) +H'C,'H (information matrix) 


(ili) = Yt (ii) (Y(iJi— 1) (ii 1) +H™Cy'z(i)) (updated estimate) 
(8.34) 
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The computational cost of the update of the information matrix is only 
determined by the inversion of Y(i|i) (in the time-invariant case) because 
the term H’C,'H is constant and can be kept outside the loop. The 
number of required operations is about M? + 4M? +5M. 

In order to develop the expression of the predicted state information 
matrix, the covariance of the process noise is factored as follows 
Cy = GG". As mentioned in Section 8.2.3, such a factorization is 
obtained by an eigenvector-eigenvalue decomposition Cy = VwAwVQ. 
The diagonal elements of Aw contain the eigenvalues of Cy. If some of 
the eigenvalues are zero, we remove the rows and columns in which 
these zero eigenvalues appear. Also, the corresponding columns in Vw 
are removed. Suppose that the number of nonzero eigenvalues is K, 
then Ay becomes a K x K matrix and Vy an M x K matrix. Conse- 
quently, G = V,,A!? is also an M x K matrix. 

Furthermore, we define the matrix A(i) as: 


ANL (EF!) YGF! (8.35) 


In other words, A~'(i) = FC(ż|i)F" is the predicted covariance matrix in 
the absence of process noise. Here, we have silently assumed that the 
matrix F is invertible (which is the case if the time-discrete system is an 
approximation of a time-continuous system). 

The information matrix of the predicted state follows from (8.2): 


Y(i+ 1|i) = (FC(ili) FT + GG")! 
( 9 A : Ge") : (predicted state information) 
= (A (i) + a 


(8.36) 


Using the matrix inversion lemma, this expression can be moulded in the 
easier to implement form: 


Y(i+ 1’) = A(i) — A(i)G(G7A(i)G + I) 'GTA(i) (8.37) 


This completes the Ricatti loop. The number of required operations of 
(8.36) and (8.37) is 2M? + 2M?K + 2MK? + R°. 

As mentioned above, the information filter can represent the situ- 
ation where no information about some states is available. Typically, 
this occurs at the initialization of the filter, if no prior knowledge is 
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available. However, the information matrix cannot represent a situa- 
tion where states are known precisely, i.e. without uncertainty. 
Typically, such a situation occurs when the system (F,G) is not 
controllable. 

The information filter also offers the possibility to define a stochastic 
observability criterion. For that purpose, consider the system without 
process noise. In that case, Y(i+ 1|i) = (Ft TY (ili) Ft. Starting with 
complete uncertainty, i.e. Y(0| — 1) = 0, the prediction information at 
time 7 is found by iterative application of (8.34): 


Y(i|i— 1) = T(E) aC HEY (8.38) 


j=0 


Note the similarity of this expression with the observability Grammian 
given in (8.17). The only difference is the information matrix Cy! which 
weighs the importance of the measurements. Clearly, if for some i all 
eigenvalues of Y(i|i— 1) are positive, then the measurements have pro- 
vided information to all states. 


Example 8.13 The information filter 

The results obtained by the information filter are shown in Figure 
8.10. The negative eigenvalues of C(i\i) = Y~! (iji) indicate that the 
filter is not robust with respect to round-off errors. 
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Figure 8.10 Results from the information filter 
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8.3.4 Square root filtering 


The square root of a square matrix P is a matrix A such that P = AA. 
Sometimes the matrix B that satisfies P = B’B is also called a square 
root, but strictly speaking such a B is a Cholesky factor, and not a square 
root. Anyway, square roots, Cholesky factors and other factorizations 
are useful matrix decomposition methods that enable stable implemen- 
tations of the Kalman filter. The principal idea in square root filtering is 
to decompose a covariance matrix P as P = B’B (or likewise), and to use 
B as a representation of P. This effectively doubles the precision of the 
number representation. The various factorization methods lead to 
various forms of square root filtering. 

This section describes one particular implementation of a square root 
filter, the Potter implementation (Potter and Stern, 1963). It uses: 


e Triangular Cholesky factorization of the error covariance matrices. 

e Sequentially processing of the measurements using symmetric ele- 
mentary matrices. 

e OR factorization for the prediction. 


The update in Potter’s square root filter 


We will represent the error covariance matrix C(i|/) by an upper triangu- 
lar matrix B(i|7) where: 


C(i|j) = B” GBC) (8.39) 


An upper triangular matrix is a matrix with all elements below the 
diagonal equal to zero, e.g. 


boo boi boz 
0 bu bi 
B= 0 0 b» (8.40) 
0o 0 0 ` 


The update formula, as expressed in (8.27): 


C(ili) = Clili — 1) — Clili — 1)HT (HC(ili — 1)H™ + C,) 'HC(ili — 1) 
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turns into: 

B(iji)’ B(i|i) =B" (iji — 1)B(ili — 1) — B” (|) BGi|i — 1)H” 

(HB? (ii — 1)B(iJi — 1)H?+C,) "HB! (ii — 1)B(ili — 1) 
=B" (ili — 1)B(iji— 1) — BT (ili — 1)M 

(M™M +C,) 'MTB(ili— 1) 

=B" (ili —1) (1 - M(M™M + Cy) 'M™) Bi =f) 





(8.41) 
where MŽ Bíiji —1)H? is an M x N matrix. 
If we would succeed in finding a Cholesky factor of: 
I-M(M™M+C,) M" (8.42) 


then we would have an update formula entirely expressed in Cholesky 
factors. Unfortunately, in the general case it is difficult to find such a 
factor. 

For the special case of only one measurement, N = 1, a factorization 
is within reach. If N = 1, then H becomes a row vector h. The matrix M 
becomes an M dimensional (column) vector m = B(i|i — 1)h’. Substitu- 
tion in (8.42) yields: 


I—m(m™m + 02) m" (8.43) 


aż is the variance of the measurement noise. 

Expression (8.43) is of the form I — amm" with a = 1/(||m||? + o2). 
Such a form is called a symmetric elementary matrix. The form can be 
easily factored by: 





Q 
gi 
=. 
| 
2 
3 








1 2 Oh (8.44) 
with b= l = : 7 (: T a) 


2 
lmi mli v 


Substitution of (8.44) in (8.41) yields: 


B(il) "B(G = BT (ii — 1) (I bmm?) (I — bmm”) B(iji—1) (8.45) 
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This gives us finally the update formula in terms of Cholesky factors: 
square root filtering update : 


1 o2 
B(iji) = (I — Bmm! )B(ili — 1 ith B= (i t ) 
ere a mP Vimo 


(8.46) 





The general solution, when more measurements are available, is 
obtained by sequentially processing them in exactly the same way as 
discussed in Section 8.3.2. 


The prediction in Potter’s square root filter 


The prediction step C(i+1|i) = FC(i|i)F’ +Cw can be written as a 
product of Cholesky factors: 





C(i+ 1]) = FC()ET + Cy = (ck) eBay] | = | (8.47) 


where C2 is the square root of Cw. Thus, if we define the 2M x M matrix 


de Lens A ae 
A“! (CL) EB" (i |i)|’, then the prediction covariance matrix is found as 


C(i+1|i) = ATA. 

A QR factorization of a matrix A (not necessarily square) produces an 
orthonormal matrix Q and an upper triangular matrix R such that A = QR. 
The matrices Q and R have compatible dimensions. Such a factorization is 
what we are looking for because, since Q is orthonormal (QTQ = I), we 
have: ATA = RR. The procedure to get B(i|i + 1) simply boils down to 
constructing the matrix A, and then performing a QR factorization. 

An implementation of Potter’s square root filter is given in Listing 8.4. 
MATLAB provides two functions for a triangular Cholesky factorization. 
We need such a function only at the initialization of the filter to get a 
factorization of C,(0) and Cy. The function chol() is applicable to 
positive definite matrices. In our case, both matrices can have zero 
eigenvalues. The cholinc() can also handle positive semidefinite 
matrices, but is only applicable to sparse matrices. The functions 
sparse() and full() take care of the conversions. Note that cho- 
linc () returns a matrix whose number of rows, in principle, equals the 
number of nonzero eigenvalues. Thus, possibly, the matrix should be 
filled up with zeros to obtain an M x M matrix. 
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The QR factorization in the prediction step is taken care of by qr (). 
The size of the resulting matrix is (M + K) x M where K is the number of 
rows in SqCw. Only the first M rows carry the needed information and 
the remaining rows are deleted. 

The update of the algorithm requires N(M? + 3M? + M) operations. 
The number of operations of the prediction is determined partly by qr (). 
This function requires about 1 M° + M?K operations. In full, the predic- 
tion requires 3M? + M?K operations. 


Listing 8.4 
Potter’s implementation in MATLAB. 


load linsys 
[B,p] =cholinc(sparse(Cx0),‘0"); 
B= ful l(B)s B(p:M; +) S02 
x_est=x0; 
[Sqcw,p] =cholinc(sparse(Cw),‘0’); 
SqCw= full (SqCw) ; 
while (1) 
Z=acquire_measurement_vector(); 
%1. Sequential update: 
for n= Len 
m=B*H(n,:)'; 
norm_m=m’ *m; 
S=norm_m+Cv(n,n); 
K=B’*m/S; 
inno=z(n)—H(n,:)*x_est; 
x_est =x_est+K*inno; 
beta= (1+ sqrt (Cv(n,n)/S) )/norm_m; 
B= (eye (M)-beta*m*m’ ) *B; 


Load a system: 
Initialize squared 
prior uncertainty 
and prior mean 
Squared Cw 


AP dP dP AP Æ 


ae 


Endless loop 





For allmeasurements... 
get row vector fromH 


de Ae 


innovation variance 
Kalman gain vector 


de oe 


oe 


update estimate 


ae 


covariance update 


%2. Prediction: 
=get_control_vector(); 
est =F*x_est+L*u; 


u 

xX Predict the state 
A= [SqCw; B*F’]; 

[ 

B 


Create block matrix 
QR factorization 
Delete irrelevant part 


dP dP X Æ 


Example 8.14 Square root filtering 

The results obtained by Potter’s square root filter are shown in 
Figure 8.11. The double logarithmic plot of the minimum eigen- 
value of P = C(i|i) shows that the eigenvalue is of the order O(i~'). 
This is exactly according to the expectations since this eigenvalue 
relates to a state without process noise. Thus, the variance of the 
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estimation error should be inversely proportional to the number of 
observations. 


8.3.5 Comparison 


In the preceding sections, five different implementations are discussed 
which are all mathematically equivalent. However, different implemen- 
tations have different sensitivities to round-off errors and different com- 
putational cost. 

Table 8.1 provides an overview of the cost expressed in the number 
of operations required for a single iteration. The table assumes a time 
invariant system so that terms like H’C,'H can be reused. Furthermore, 
the numbers are based on a straightforward MATLAB implementation with- 
out optimization with respect to computational cost. Special code that 
exploits the symmetry of covariance matrices can lower the number of 
operations a little. The computational efficiency of the square root filter 
can be improved by consistently maintaining the triangular structure of the 
matrices. (In the current implementation, the triangular structure is lost 
during the update, but is regained by the QR factorization.) 

A quantitative, general analysis of the sensitivities of the various 
implementations to round-off errors is difficult. However, Table 8.1 
gives an indication. The table shows the results of an experiment that 
relates to the system described in Example 8.10. For each implementation, 


Table 8.1 Comparison of different implementation 








Computational Computational Required 
cost cost no. of 
update prediction digits 
Conventional 2M?N + 3MN? + N? 2M? 12 
Kalman filter 
MMSE form 2M? + M*+M 2M3 13 
Sequential 2M?N + 2MN 2M? 12 
processing 
of measurements 
Information filter M? +5M*+5M 2M? + 2M?K + 2MK? + K? 11 
Potter’s square N(M? + 3M? + M) 3 M? + M?°K 5 
root filter 





M = number of states. 
N = number of measurements. 
K = effective dimension of process noise vector. 
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Figure 8.11 Results from Potter’s square root filter 


the number of digits needed for the number representations of the vari- 
ables (including intermediate results) was established, in order to have a 
stable and consistent result. 

As expected, the square root filter is the most numerical stable 
method, but it is also the most expensive one. Square root filtering 
should be considered: 


e If other implementations result in covariance matrices with nega- 
tive eigenvalues. 

e If other implementations involve matrix inversions where the 
inverse condition number, i.e. Amin/Amax, Of the matrix is in the 
same magnitude of the round-off errors. 


The MMSE form is inexpensive if the number of measurements is large 
relative to the number of states, i.e. if N >> M. The sequentially 
processing method is inexpensive, especially when both N and M are 
large. 


8.4 CONSISTENCY CHECKS 


The purpose of this section is to provide some tools that enable the 
designer to check whether his design behaves consistently (Bar-Shalom 
and Birmiwal, 1983). As discussed in the previous sections, the two main 
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reasons why a realized filter does not behave correctly are modelling 
errors and numerical instabilities of the filter. 
Estimators are related with three types of error variances: 


e The minimal variances that would be obtained with the most 
appropriate model. 

e The actual variances of the estimation errors of some given estima- 
tor. 

e The variances indicated by the calculated error covariance matrix 
of a given estimator. 


Of course, the purpose is to find an estimator whose error variances 
equal the first one. Since we do not know whether our model approaches 
the most appropriate one, we do not have the guarantee that our design 
approaches the minimal attainable variance. However, if we have 
reached the optimal solution, then the actual variances of the estimation 
errors must coincide with the calculated variances. Such a correspondence 
between actual and calculated variances is a necessary condition for an 
optimal filter, but not a sufficient one. 

Unfortunately, we need to know the real estimation errors in order to 
check whether the two variances coincide. This is only possible if the 
physical process is (temporarily) provided with additional instruments that 
give us reference values of the states. Usually such provisions are costly. This 
section discusses checks that can be accomplished without reference values. 


8.4.1 Orthogonality properties 


We recall that the update step of the Kalman filter is formed by the 
following operations (8.2): 


z(i) = H(4)x(i|i — 1) (predicted measurement) 
Z(i) = z(i) — 2(i) (innovations) (8.48) 
X(ili) = X(ili — 1) + K(i)Z(i) (updated estimate) 
The vectors z(i) are the innovations (residuals). In linear-Gaussian sys- 


tems, these vectors are zero mean with the innovation matrix as covari- 
ance matrix: 


S(i) = H(i)C(iji — 1)HT (i) + C (a) (8.49) 
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de, : 5 $ 
Furthermore, let e(ilj) eli) — x(i|j) be the estimation error of the esti- 
mate x(i|j). We then have the following properties: 


E[e(i|/)z" (m)| = 0 msi 
E[e(i|/)x’ (a|m)] = 0 m<j (8.50) 
EIZ(DZ N] = 6(i, f)S(i) i.e. Z(i) is white 


These properties follow from the principle of orthogonality. In the static 
case, any unbiased linear MMSE satisfies: 


Ele(z — z)"] = E[(x — Kz— (X — Kz))(z—z)"] = Cx —KC, = 0 
(8.51) 


The last step follows from K = Cy,C; ' See (3.29). The principle of 
orthogonality, Efe’(z—z)]=0, simply states that the covariance 
between any component of the error and any measurement is zero. 
Adopting an inner product definition for two random variables e,, and 


Zn AS (Em, Zm) ie Peat the principle can be expressed as e1z. 

Since X(i|j) is an unbiased linear MMSE estimate, it is a linear function 
of the set {x(0), Z(j)} = {(0), z(0), z(1), ...,z(/)}. According to the prin- 
ciple of orthogonality, we have e(i|/)Z(j). Therefore, e(i|j) must also be 
orthogonal to any z(m) m < j, because z(m) is a subspace of Z(j). This 
proves the first statement in (8.50). 

In addition, X(n|m), m < j, is a linear combination of Z(m). Therefore, 
it is also orthogonal to e(i|j), which proves the second statement. 

The whiteness property of the innovations follows from the following 
argument. Suppose j;<i. We may write: E[z(i)z! (j)] = E[E[z(z) 
ZT (j)|Z(;)]]. In the inner expectation, the measurements Z(j) are known. 
Since z(j) is a linear combination of Z(j),z(j) is non-random. It can be 
taken outside the inner expectation: E[z(i)z"(j)] = E[E[z(a)|Z(j)]z" (/)]. 
However, E[z(i)|Z(j)] must be zero because the predicted measurements 
are unbiased estimates of the true measurements. If E[z(i)|Z(j)] = 0, then 
E[z(i)z7 (j)] = 0 (unless i = /). 


8.4.2 Normalized errors 


The NEES (normalized estimation error squared) is a test signal defined as: 


Nees(i) = e” (il) (ili)e() (8.52) 
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In the linear-Gaussian case, the NEES has a 4, distribution (chi-square 
with M degrees of freedom; see Appendix C.1.5). 

The xê; distribution of the NEES follows from the following 
argument. Since the state estimator is unbiased, E[e(i|i)] = 0. The 
covariance matrix of e(i|i) is C(i|i). Suppose A(i) is a symmetric 
matrix such that A(i)A7(i) = C7! (ii). Application of A(i) to e(é) will 
give a random vector y(i) = A(i)e(i). The covariance matrix of y(i) is: 
A(i)C(i|i)A7 (i) = A(i)(A(i)A’ (i))- 1A‘ (i) = I. Thus, the components of 
y(i) are uncorrelated, and have unit variance. If both the process 
noise and the measurement noise are normally distributed, then so 
is y(i). Hence, the inner product y! (i)y(i) = e! (i)C~' (ili)e(i) is the sum 
of M squared, independent random variables, each normally 
distributed with zero mean and unit variance. Such a sum has a yi, 
distribution. 

The NIS (normalized innovation squared) is a test signal defined as: 


Nis(i) = Z (i)S~'(i)z(i) (8.53) 


In the linear-Gaussian case, the NIS has a x4 distribution. This follows 
readily from the same argument as used above. 


Example 8.15 Normalized errors of second order system 
Consider the following system: 


_[ 0.999cos(0.1n) 0.999sin(0.17)] 1 _ 14 os 
Sa ene = Le oal 
0 
C= |, o G, = [1] (8.54) 
Soe eu EKO = |] 


Figure 8.12 shows the results of a MATLAB realization of this system 
consisting of states and measurements. Application of the discrete 
Kalman filter yields estimated states and innovations. From that, the 
NEES and the NIS are calculated. 

In this case, M = 2 and N = 1. Thus, the NEES and the NIS should 
obey the statistics of a x3 and a x? distribution. The 95% percentiles 
of these distributions are 5.99 and 3.84, respectively. Thus, about 
95% of the samples should be below these percentiles, and about 5% 
above. Figure 8.12 affirms this. 
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Figure 8.12 Innovations and normalized errors of a state estimator for a second 
order system 


8.4.3 Consistency checks 


From equation (8.50) and the properties of the NEES and the NIS, the 
following statement holds true: 
If a state estimator for a linear-Gaussian system is optimal, then: 


e The sequence Nees(i) must be xå; distributed. 
e The sequence Nis(i) must be xå distributed. 
e The sequence 2(i) (innovations) must be white. 


Consistency checks of a ’state estimator-under-test’ can be performed by 
collecting the three sequences and by applying statistical tests to see 
whether the three conditions are fulfilled. If one or more of these condi- 
tions are not satisfied, then we may conclude that the estimator is not 
optimal. 

In a real design, the NEES test is only possible if the true states are 
known. As mentioned above, such is the case when the system is 
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(temporarily) provided with extra measurement equipment that meas- 
ures the states directly and with sufficient accuracy so that their outputs 
can be used as a reference. Usually, such measurement devices are too 
expensive and the designer has to rely on the other two tests. However, 
the NEES test is applicable in simulations of the physical process. It can 
be used to see if the design is very sensitive to changes in the parameters 
of the model. The other two tests use the innovations. Since the innova- 
tions are always available, even in a real design, these are of more 
practical significance. 

In order to check the hypothesis that a data set obeys a given distribu- 
tion, we have to apply a distribution test. Examples are the 
Kolmogorov—Smirnov test and the Chi-square test (Section 9.3.3). Often, 
instead of these rigorous tests a quick procedure suffices to give an 
indication. We simply determine an interval in which, say, 95% of the 
samples must be. Such an interval [A,B] is defined by 
F(B) — F(A) = 0.95. Here, F(-) is the probability distribution of the 
random variable under test. If an appreciable part of the data set is 
outside this interval, then the hypothesis is rejected. For chi-square 
distributions there is the one-sided 95% acceptance boundary with 


A = 0, and B such that Fe AB) = 0.95. Sometimes the two-sided 95% 
acceptance boundaries are used, defined by Fa AA) = 0.025 and 


Fa 8) = 0.975. Table 8.2 gives values of A and B for various degrees 
of freedom (Dof). 

For state estimators that have reached the steady state, S(i) is constant, 
and the whiteness property of the innovations implies that the power 
spectrum of any element of z(i) must be flat.” Suppose that Z,,(i) is an 
element of z(i). The variance of Z,(i), denoted by 0%, is the n-th diagonal 
element in S. The discrete Fourier transform of Z,(i), calculated over 
i=0,...,]—1, is: 


Žak =X nije P™* k=0,...,I-1 j=Vv-1 (8.55) 


The periodogram of Z,(i), defined here as P„(k) = IZn(k)|°/I, is an esti- 
mate of the power spectrum (Blackman and Tukey, 1958). If the power 
spectrum is flat, then E[P,,(k)] = 02. It can be proven that in that case the 





? For estimators that are not in the steady state, the innovations have to be premultiplied by 
S~ (i) so that the covariance matrix of the resulting sequence becomes constant. 
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Table 8.2 95% Acceptance boundaries for x$, distributions 





One-sided Two-sided 
Dof A B A B 
1 0 3.841 0.0010 5.024 
2 0 5.991 0.0506 7.378 
3 0 7.815 0.2158 9.348 
4 0 9.488 0.4844 11.14 
5 0 11.07 0.8312 12.83 


variables 2P,,(k)/o2 is x3 distributed for all k (except for k = 0). Hence, 
the whiteness of Z,(i) is tested by checking whether 2P,,(k)/o is x3 
distributed. 


Example 8.16 Consistency checks applied to a second order system 
The results of the estimator discussed in Example 8.15 and presented 
in Figure 8.12 pass the consistency checks successfully. Both the 
NEES and the NIS are about 95% of the time below the one-sided 
acceptance boundaries, i.e. below 5.99 and 3.84. The figure also 
shows the normalized periodogram calculated as 2P„(k)/ô? with ô? 
the estimated variance of the innovation. The normalized periodo- 
gram shown seems to comply with the theoretical x} distribution. 


Example 8.17 Consistency checks applied to a slightly mismatched 
filter 

Figure 8.13 shows the results of a state estimator that is applied to the 
same data as used in Example 8.15. However, the model the estimator 
uses differs slightly. The real system matrix F of the generating pro- 
cess and the system matrix Fy., on which the design of the state 
estimator is based are as follows: 


0.999 cos(0.17) 0.999 sin(0.17) 
~ ee nae 

p _ | 0999cos(0.1167) 0.999 sin(0.1167) 
G Eoo | 


Apart from that, the model used by the state estimator exactly 
matches the real system. 

In this example, the design does not pass the whiteness test of 
the innovations. The peak of the periodogram at k = 6 is above 20. 
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Figure 8.13 Innovations and normalized errors of a state estimator based on a 
slightly mismatched model 


For a xå distribution such a high value is unlikely to occur. (In fact, 
the chance is smaller than 1 to 20 000.) 


8.4.4 Fudging 


If one or more of the consistency checks fail, then somewhere a serious 
modelling error has occurred. The designer has to step back to an earlier 
stage of the design in order to identify the fault. The problem can be 
caused anywhere, from inaccurate modelling during the system identifi- 
cation to improper implementations. If the system is nonlinear and the 
extended Kalman filter is applied, problems may arise due to the neglect 
of higher order terms of the Taylor series expansion. 

A heuristic method to catch the errors that arise due to approxima- 
tions of the model is to deliberately increase the modelled process noise 
(Bar-Shalom and Li, 1993). One way to do so is by increasing the 
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covariance matrix Cw of the process noise by means of a fudge factor y. 
For instance, we simply replace Cy by Cy +I. Other methods to 
regulate a covariance matrix are discussed in Section 5.2.3. Instead of 
adapting Cw we can also increase the diagonal of the prediction covari- 
ance C(i + 1|i) by some factor. The fudge factor should be selected such 
that the consistency checks now pass as successful as possible. 

Fudging effectuates a decrease of faith in the model of the process, and thus 
causes a larger impact of the measurements. However, modelling errors give 
rise to deviations that are autocorrelated and not independent from the 
states. Thus, these deviations are not accurately modelled by white noise. 
The designer should maintain a critical attitude with respect to fudging. 


8.5 EXTENSIONS OF THE KALMAN FILTER 


The extensions considered in this section make the Kalman filter applic- 
able to a wider class of problems. In particular, we discuss extensions to 
cover non-white and cross-correlated noise sequences. Also, the topic of 
offline estimation will be introduced. 


8.5.1 Autocorrelated noise 


The Kalman filter considered so far assumes white uncorrelated random 
sequences w(i) and v(i) for the process and measurement noise. What do 
we do if these assumptions do not hold in practice? 

The case of autocorrelated noise is usually tackled by assuming a state 
space model for the noise. For instance,’ autocorrelated process noise is 
represented by w(i+ 1) = Fyw(i) + w(i) where w(i) is a white noise 
sequence with covariance matrix Cy. State augmentation reduces the 
problem to a standard form. For that, the state vector is extended by w(i): 


F I x(i) 0 
0 Fy] | w(2) 


w(i) 

















(8.56) 


N 
AE 
a 
| 
T 
oO 
-—— 








3 For convenience of notation the time index of matrices (for time variant systems) is omitted in 
this section. 
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The new, augmented state vector can be estimated using the standard 
Kalman filter. 

The case of autocorrelated measurement noise is handled in the same 
way although some complications may occur (Bryson and Hendrikson, 
1965). Suppose that the measurement noise is modelled by 
v(i+ 1) = Fyv(i) + ¥(i) with ¥(i) a white noise sequence with covariance 
C;. State augmentation brings the following model: 





The process noise of the new system consists of two terms. Taken 
together, the corresponding covariance matrix is [I O]’Cy[I 0]+ 
[0 I] G{[0 I]. 

In the new, augmented system there is no measurement noise. The 
corresponding covariance matrix is zero. This is not necessarily a pro- 
blem because the standard Kalman filter does not use the inverse of the 
measurement covariance matrix. However, the computation of the Kal- 
man gain matrix does require the inverse of HC(i|i— 1)HT + Cy. If in 
this expression Cy = 0, the feasibility of calculating K depends on the 
invertibility of HC(i|i — 1)H’. The statement Cy = 0 implies that some 
linear combinations of the state vector is known without any uncer- 
tainty. If in this subspace of the state space no process noise is active, 
ultimately HC(i|i — 1)HT becomes near singular, and the filter becomes 
unstable. The solution for this potential problem is to apply differencing. 
Instead of using z(i) directly, we use the differenced measurements 
yli) = 2(i) — Fyz(i — 1) = Hx(i) — FyHx(i — 1) + (i — 1). 


Example 8.18 Suppression of 50 Hz emf interference 
Consider a signal x(t) that is disturbed by a 50 (Hz) emf interference 
caused, for instance, by an inductive crosstalk of an electric power 
line. The bandwidth of the signal is B = 10 (rad/s), and the sampling 
period is A = 1 (ms). We model the signal by a first order system 
x =—Bx-+w which in discrete time becomes x(i+ 1) = (1 — BA) 
x(i) + w(i). 

An inductive coupling with a power line induces an interfering 
periodic waveform with a ground harmonic of fọ = 50 (Hz). Usually, 
such a waveform also contains a component with double frequency 
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(and possibly higher harmonics, but these will be neglected here). The 
disturbance can be modelled by two second order equations, that is: 


cos(2mfoA)  sin(2rfoA) 
; „| —sin(2nfoA) cos(2nfoA) ere 
ce i o i E N E S 

—sin(4rfoA) cos(47foA) 


(8.58) 


The factor d is selected close to one, modelling the fact that the 
magnitudes of each component vary in time only slowly. 
Application of the augmented state estimator to a simulation of 
the process shows results as depicted in Figure 8.14. The Bode 
diagram clearly shows that the state estimator acts as a double-notch 
filter. The width of the notch depends on the choice of d. The Bode 
diagram, valid for the steady state Kalman filter, is obtained with the 
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Figure 8.14 Suppression of 50 Hz emf interference based on Kalman filtering 
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following MATLAB fragment (making use of the Control System 
Toolbox): 


oe 


sys=ss(Fn,eye(5),H,0,Ts); 
[Kest,L,P,M,Z] =kalman(sys,Wn,0); 
bodemag (Kest(2,:),‘k’); % 


Create a state space model 
Find the steady state KF 
Plot the Bode diagram 


ae 


8.5.2 Cross-correlated noise 


Another situation occurs when the process and measurement noise are 
cross-correlated, that is, Cwy = E[w/(i)v!(i)] 4 0. Such might happen if 
both the physical process and measurement system is affected by the 
same source of disturbance. Examples are changes of temperature and 
electrical inference due to induction. 

The strategy to bring this situation to the standard estimation problem 
is to introduce a modified state equation by including a term T(z(i)— 
Hx(z) — v(z)) = 0: 





i) + T(z(é) — Hx(i) — v(i) (8.59) 





The factor F—TH is the modified transition matrix. The terms 
w(i) — Tv(i) are regarded as process noise. The term Tz(i) is known 
and can be regarded as a control input. 

Note that T can be selected arbitrarily because T(z(i) — Hx(i) —v(i)) = 0. 
Therefore, we can select T such that the new process noise becomes 
uncorrelated with respect to the measurement noise. That is, 
E[(w(i) — Tv(i))v?(a)] = 0. Or: Cyy—TC,=0. In other words, if 
T = CwC,', the modified state equation is in the standard form without 
cross-correlation. 





8.5.3 Smoothing 


Up to now only online estimation of continuous states has been con- 
sidered. The topic of prediction has been touched on in Section 4.2.1. Here, 
we introduce the subject of offline estimation, generally referred to as 
smoothing. Assuming a linear-Gaussian model of the process and 
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measurements, we have recorded a set of measurements Zx = {z(0), 
z(1),...,z(K)}. Using the prior knowledge E[x(0)] and C,(0) we want 
estimates for some points in time 0 < k < K. We emphasize once again 
that this section is only introductory. For details we refer to the pertinent 
literature (Gelb etal., 1974). 

The problem is often divided into three types of problems: 


e Fixed interval smoothing: K is fixed. k is variable between 0 and K. 

e Fixed point smoothing: k is fixed. K increases with time, K = i. 

e Fixed lag smoothing: both k and K increase with time, but with 
K—k fixed to the so-called lag=K—k. Thus, K=i and 
k =i- lag. 


Fixed interval smoothing is needed most often. The problem occurs 
when an experiment has been done, the data has been acquired and 
stored, and the data is analysed afterwards. 

The general approach to smoothing is the same as for discrete states. 
See Section 4.3.3. The estimation occurs in two passes. In the first pass, 
the data is processed forward in time to yield estimates x/(k) = X(k|k). In 
the second pass, the data is processed backward in time. Starting with 
k = K we recursively estimate the previous states using only data from 
the ‘future’. Thus, the backward estimate x,(k) only uses information 
from Zk — Zg = {z(k +1),...,z(K)}. For that, a ‘reversed time’ 
Kalman filter can be used. The final estimate x(k|K) is obtained by 
optimally combining x/(k) and x;,(k). 

Although this approach yields the desired optimally smoothed states, it is 
not computationally efficient. For each of the three different types of smooth- 
ing problems more efficient algorithms have been proposed. One of them is 
the well-known Rauch-Tung-Striebel smoother (Rauch etal., 1965). The 
algorithm implements a fixed interval smoother. It does not explicitly use 
x,(k). Instead, it uses recursively x(k|K). The algorithm is as follows: 


Algorithm 8.2: Rauch-Tung-Striebel smoother 


1. Apply the standard discrete Kalman filter to find the offline estimates 
and store the results, that is, the estimates x(k|k), C(k|k), along with 
the one-step-ahead predictions x(k + 1|k), C(k + 1k). 

2. For k = K — 1 looping back to k = 0 with step = —1: 

2.1. A = C(k|k)F! C! (k + 1k) 
2.2. x(k|K) = x(k|k) + A(x(k + 1|K) — x(k + 1|k)) 
2.3. C(k|K) = C(k|k) + A(C(k + 1|K) — C(k + 1|k))AT 
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Step 2.3 is only required if we are interested in the smoothing covariance 
matrix. 


Example 8.19 Estimation of a transient of an electrical RC circuit 

Figure 8.15 shows an electric circuit consisting of a capacitor con- 
nected by means of a switch to a resistor. The resistor represents the 
input impedance of an AD converter that measures the voltage z. The 
voltage across the capacitor is x. At t = 0 the switch closes, giving rise 
to a measured voltage z = x + v (v is regarded as sensor noise). The system 
obeys the following state equation x = —x/(RC) with RC = 10 (ps). 








x() FEC zoti R 

















Figure 8.15 The measurement of a transient in an electrical RC network 
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Figure 8.16 Estimation of a transient by means of filtering and by smoothing 
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In discrete time this becomes x(i+ 1) =(1— A/(RC))x(i). The 
sampling period is A = 0.1 (us). K = 100. No prior knowledge about 
x(0) is available. 

Figure 8.16 shows observed measurements along with the corres- 
ponding estimated states and the result from the Rauch-Tung-Striebel 
smoother. Clearly, the uncertainty of the offline obtained estimate 
is much smaller than the uncertainty of the Kalman filtered result. 
This holds true especially in the beginning where the online filter has 
only a few measurements at its disposal. The offline estimator can 
take advantage of all measurements. 
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8.7 EXERCISES 


1. The observed sequence of data shown in Figure 8.17 is available in the data file 
C8exercisel.mat. Use MATLAB to determine the smallest order of an autoregressive 
model that is still able to describe the data well. Determine the parameters of that model. (*). 

2. Given the system: 


0.32857 —0.085714 
1.1429 1.0714 


| H=[10 5] 


Determine the observability Grammian and the observability matrix. What are the 
eigenvalues of these matrices? What can be said about the observability? (0) 
3. Given the system x(i+ 1) = Fx(i) + Gw(i) and z(i) = Hx(i) + v(i) with 


0.65  —0.06 2 
F= G= | | H=[1 1] 
—0.375 0.65 5 


w(i) and v(i) are white noise sequences with unit variances. Examine the observability 
and the controllability of this system. Does the steady state Kalman filter exist? If so, 
determine the Kalman gain, the innovation matrix, the prediction covariance matrix 
and the error covariance matrix. (*) 


4. Repeat exercise 3, but this time with: 


r 0.85 014] _ 
~ | —0.875 0.85 


5. Repeat exercise 3, but now with: 
0.75  —0.1 | 
F= 6) 
—0.625 0.75 


10 p x(/) 
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Figure 8.17 Observed random sequence in exercise 1 
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Figure 8.18 Observed measurements from a drifting sensor 


6. Explain the different results obtained in exercises 3, 4 and 5, by examining the 
eigenvalues of F in the different cases. (**) 

7. Determine the computational complexity of the information filter. (*) 

8. Drift in the measurements. 
We consider a physical quantity x(i) that is sensed by a drifting sensor whose output is 
modelled by z(i) = x(i) + v(i) and v(i + 1) = Gv(i) + ù(i) with 8 = 0.999. v(7) is a white 
noise sequence with zero mean and variance o2 = 0.002. The physical quantity has a 
limited bandwidth modelled by x(i+ 1) = ax(i) + w(i) with a = 0.95. The process 
noise is white and has a variance o2, = 0.0975. A record of the measurements is shown 
in Figure 8.18. The data is available in the file C8exercise8.mat. 


Give a state space model of this system. (0) 
Examine the observability and controllability of this system. (0) 


Give the solution of the discrete Lyapunov equation (0) 


Realize the discrete Kalman filter. Calculate and plot the estimates, its boundary, 
and the innovations and the periodogram of the innovations. (*) 
@ Compare the signal-to-noise ratios before and after filtering. (0) 


@ Perform the consistency checks. (*) 
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Worked Out Examples 


In this final chapter, three worked out examples will be given of the 
topics discussed in this book: classification, parameter estimation and 
state estimation. They will take the form of a step-by-step analysis of 
data sets obtained from real-world applications. The examples demon- 
strate the techniques treated in the previous chapters. Furthermore, they 
are meant to illustrate the standard approach to solving these types of 
problems. Obviously, the MATLAB and PRTools algorithms as they were 
presented in the previous chapters will be used. The data sets used here 
are available through the website accompanying this book. 


9.1 BOSTON HOUSING CLASSIFICATION PROBLEM 
9.1.1 Data set description 


The Boston Housing data set is often used to benchmark data analysis 
tools. It was first described in Harrison and Rubinfield (1978). This paper 
investigates which features are linked to the air pollution in several areas in 
Boston. The data set can be downloaded from the UCI Machine Learning 
repository at http://www.ics.uci.edu/~mlearn/MLRepository.html. 

Each feature vector from the set contains 13 elements. Each feature 
element provides specific information on an aspect of an area of a 
suburb. Table 9.1 gives a short description of each feature element. 





Classification, Parameter Estimation and State Estimation: An Engineering Approach using MATLAB 
F. van der Heijden, R.P.W. Duin, D. de Ridder and D.M.J. Tax 
© 2004 John Wiley & Sons, Ltd ISBN: 0-470-09013-8 
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Table 9.1 The features of the Boston Housing data set 


Feature name 


Description 


Feature range and type 





1 CRIME Crime rate per capita 0-89, real valued 
2 LARGE Proportion of area 0-100, real valued, 
dedicated to lots larger than but many have value 0 
25 000 square feet 
3 INDUSTRY Proportion of area dedicated 0.46-27.7, real valued 
to industry 
4 RIVER 1 = borders the Charles 0-1, nominal values 
river, 0 = does not border 
the Charles river 
5 NOX Nitric oxides concentration 0.38-0.87, real valued 
(in parts per 10 million) 
6 ROOMS Average number of rooms 3.5-8.8, real valued 
per house 
7 AGE Proportion of houses built 2.9-100, real valued 
before 1940 
8 WORK Average distance to employment  1.1-12.1, real valued 
centres in Boston 
9 HIGHWAY Highway accessibility index 1-24, discrete values 
10 TAX Property tax rate (per $10 000) 187-711, real valued 
11 EDUCATION © Pupil-teacher ratio 12.6-22.0, real valued 
12 AA 1000(A — 0.63)? where A isthe 0.32-396.9, real valued 
proportion of African-Americans 
13 STATUS Percentage of the population 1.7-38.0, real valued 


which has lower status 


The goal we set for this section is to predict whether the median price 
of a house in each area is larger than or smaller than $20 000. That is, we 
formulate a two-class classification problem. In total, the data set con- 
sists of 506 areas, where for 215 areas the price is lower than $20 000 
and for 291 areas it is higher. This means that when a new object has to 
be classified using only the class prior probabilities, assuming the data 
set is a fair representation of the true classification problem, it can be 
expected that in (215/506) - 100% = 42.5% of the cases we will mis- 
classify the area. 

After the very first inspection of the data, by just looking what values 
the different features might take, it appears that the individual features 
differ significantly in range. For instance, the values for the feature TAX 
varies between 187 and 711 (a range of more than 500), while the feature 
values of NOX are between 0.38 and 0.87 (a range of less than 0.5). This 
suggests that some scaling might be necessary. A second important obser- 
vation is that some of the features can only have discrete values, like 
RIVER and HIGHWAY. How to combine real valued features with 
discrete features is already a problem in itself. In this analysis we will 
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ignore this problem, and we will assume that all features are real valued. 
(Obviously, we will lose some performance using this assumption.) 


9.1.2 Simple classification methods 


Given the varying nature of the different features, and the fact that further 
expert knowledge is not given, it will be difficult to construct a good model 
for this data. The scatter diagram of Figure 9.1 shows that an assumption 
of Gaussian distributed data is clearly wrong (if only by the presence of the 
discrete features), but when just classification performance is considered, 
the decision boundary might still be good enough. Perhaps more flexible 
methods such as the Parzen density or the «-nearest neighbour method will 
perform better; after a suitable feature selection and feature scaling. 

Let us start with some baseline methods and train a linear and quad- 
ratic Bayes classifier, 14c and qdc: 


Listing 9.1 


% Load the housing dataset, and set the baseline performance 
load housing.mat; 


Z Show what dataset we have 


ae oe 




















w=ldc; Define an untrained linear 
classifier 
err_ldc_baseline=crossval (z,w,5) % Perform 5-fold 
cross-validation 
err_qdc_baseline=crossval(z,qdc,5) % idem for the quadratic 
classifier 
* + + * 
. 
2 =- * ++ 5 
5 4 + * 
+H HE + + + . ** E 
s * so * Pa 
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Figure 9.1 Scatter plots of the Boston Housing data set. The left subplot shows 
features STATUS and INDUSTRY, where the discrete nature of INDUSTRY can be 
spotted. In the right subplot, the data set is first scaled to unit variance, after which it 
is projected onto its first two principal components 
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The five-fold cross-validation errors of 1dc and qdc are 13.0% and 
17.4%, respectively. Note that by using the function crossval, we 
avoided having to use the training set for testing. If we had done that, 
the errors would be 11.9% and 16.2%, but this estimate would be 
biased and too optimistic. The results also depend on how the data is 
randomly split into batches. When this experiment is repeated, you 
will probably find slightly different numbers. Therefore, it is advisable 
to repeat the entire experiment a number of times (say, 5 or 10) to get 
an idea of the variation in the cross-validation results. However, for 
many experiments (such as the feature selection and neural network 
training below), this may lead to unacceptable training times; there- 
fore, the code given does not contain any repetitions. For all the 
results given, the standard deviation is about 0.005, indicating that 
the difference between 1dc and qdc is indeed significant. Notice that 
we use the word ‘significant’ here in a slightly loose sense. From a 
statistical perspective it would mean that for the comparison of the 
two methods, we should state a null-hypothesis that both methods 
perform the same, and define a statistical test (for instance a t-test) to 
decide if this hypothesis holds or not. In this discussion we use the 
simple approach in which we just look at the standard deviation of 
the classifier performances, and call the performance difference sig- 
nificant when the averages differ by more than two times their stand- 
ard deviations. 

Even with these simple models on the raw, not preprocessed data, a 
relative good test error of 13.0% can be achieved (with respect to the 
simplest approach by looking at the class probabilities). Note that qdc, 
which is a more powerful classifier, gives the worst performance. This is 
due to the relatively low sample size: two covariance matrices, with 
+ 13(13 + 1) = 91 free parameters each, have to be estimated on 80% 
of the data available for the classes (i.e. 172 and 233 samples, respect- 
ively). Clearly, this leads to poor estimates. 


9.1.3 Feature extraction 


It might be expected that more flexible methods, like the «-nearest neigh- 
bour classifier (knnc, with « optimized using the leave-one-out method) 
or the Parzen density estimator (parzenc) give better results. Surpris- 
ingly, a quick check shows that then the errors become 19.6% and 21.9% 
(with a standard deviation of about 1.1% and 0.6%, respectively), clearly 
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indicating that some overtraining occurs and/or that some feature selection 
and feature scaling might be required. The next step, therefore, is to rescale 
the data to have zero mean and unit variance in all feature directions. This 
can be performed using the PRTools function scalem([],‘variance’): 


Listing 9.2 


load housing.mat; 

% Define an untrained linear classifier w/scaled input data 
w_sc=scalem([],‘variance’); 

w=w_sc*ldc; 

% Perform 5-fold cross-validation 
err_ldc_sc=crossval(z,w,5) 

% Do the same for some other classifiers 
err_qdc_sc=crossval (z,w_sc*qdc,5) 

err_knnc_sc=crossval (z,w_sc*knnc,5) 
err_parzenc_sc=crossval(z,w_sc*parzenc,5) 


First note, that when we introduce a preprocessing step, this step should 
be defined inside the mapping w. The obvious approach, to map the 
whole data set z_sc = z*scalem(a,’ variance’) and then to apply 
the cross-validation to estimate the classifier performance, is incorrect. 
In that case, some of the testing data is already used in the scaling of the 
data, resulting in an overtrained classifier, and thus in an unfair estimate 
of the error. To avoid this, the mapping should be extended from 1dc to 
w_sc*ldc. The routine crossval then takes care of fitting both the 
scaling and the classifier. 

By scaling the features, the performance of the first two classifiers, 
ldc and qdc, should not change. The normal density based classifiers 
are insensitive to the scaling of the individual features, because they 
already use their variance estimation. The performance of knnc and 
parzenc on the other hand improve significantly, to 14.1% and 
13.1%, respectively (with a standard deviation of about 0.4%). 
Although parzenc approaches the performance of the linear classifier, 
it is still slightly worse. Perhaps feature extraction or feature selection 
will improve the results. 

As discussed in Chapter 7, principal component analysis (PCA) is one 
of the most often used feature extraction methods. It focuses on the high- 
variance directions in the data, and removes low-variance directions. 
In this data set we have seen that the feature values have very different 
scales. Applying PCA directly to this data will put high emphasis on the 
feature TAX and will probably ignore the feature NOX. Indeed, when 
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PCA preprocessing is applied such that 90% of the variance is retained, 
the performance of all the methods significantly decreases. To avoid this 
clearly undesired effect, we will first rescale the data to have unit 
variance and apply PCA on the resulting data. The basic training proced- 
ure now becomes: 


Listing 9.3 


load housing.mat; 

% Define a preprocessing 
w_pca=scalem([],‘variance’)*pca([],0.9); 
% Define the classifier 

w=w_sc*ldc; 

% Perform 5-fold cross-validation 
err_ldc_pca=crossval (z,w,5) 


It appears that, compared with normal scaling, the application of 
pca([],0.9) does not significantly improve the performances. For some 
methods, the performance increases slightly (16.6% (40.6%) error for 
qdc, 13.6% (40.9%) for knnc), but for other methods, it decreases. 
This indicates that the high-variance features are not much more informa- 
tive than the low-variance directions. 








9.1.4 Feature selection 


The use of a simple supervised feature extraction method, such as the 
Bhattacharrya mapping (implemented by replacing the call to pca by 
bhatm([])), also decreases the performance. We will therefore have to 
use better feature selection methods to reduce the influence of noisy 
features and to gain some performance. 

We will first try branch-and-bound feature selection to find five 
features, with the simple inter—intra class distance measure as a criterion, 
finding the optimal number of features. Admittedly, the number of 
features selected, five, is arbitrary, but the branch-and-bound method 
does not allow for finding the optimal subset size. 


Listing 9.4 


% Load the housing dataset 

load housing.mat; 

% Construct scaling and feature selection mapping 
w_fsf=featselo([],‘in-in’,5)*scalem([],‘variance’) ; 
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Calculate cross-validation error for classifiers 
trained on the optimal 5-feature set 
err_ldc_fsf=crossval(z,w_fsf*ldc,5) 
err_qdc_fsf=crossval(z,w_fsf*qdc,5) 
err_knnc_fsf=crossval (z,w_fsf*knnc, 5) 
err_parzenc_fsf=crossval (z,w_fsf*parzenc,5) 


% 
% 


The feature selection routine often selects features STATUS, AGE and 
WORK, plus some others. The results are not very good: the perform- 
ance decreases for all classifiers. Perhaps we can do better if we take the 
performance of the actual classifier to be used as a criterion, rather than 
the general inter—intra class distance. To this end, we can just pass the 
classifier to the feature selection algorithm. Furthermore, we can also let 
the algorithm find the optimal number of features by itself. This means 
that branch-and-bound feature selection can now no longer be used, as 
the criterion is not monotonically increasing. Therefore, we will use 
forward feature selection, featself. 


Listing 9.5 


% Load the housing dataset 

load housing.mat; 

% Optimize feature set for ldc 
w_fsf=featself([],ldc,0)*scalem([],‘variance’); 
err_ldc_fsf=crossval(z,w_fsf*ldc,5) 
% Optimize feature set for qdc 
w_fsf=featself ( ,qdc,0)*scalem([],’variance’); 
err_qdc_fsf=crossval (z,w_fsf*qdc,5) 
% Optimize feature set for knne 
w_fsf=featself([],knnc,0)*scalem([],‘variance’); 
err_knnc_fsf=crossval (z,w_fsf*knnc, 5) 

















% Optimize feature set for parzenc 
w_fsf=featself ( ,parzenc,0)*scalem([],‘’variance’); 
err_parzenc_fsf=crossval (z,w_fsf*parzenc,5) 














This type of feature selection turns out to be useful only for 1dc and 
qdc, whose performances improve to 12.8% (+0.6%) and 14.9% 
+0.5%), respectively. knnc and parzenc, on the other hand, give 
15.9% (+1.0%) and 13.9% (41.7%), respectively. These results do 
not differ significantly from the previous ones. The featself routine 
often selects the same rather large set of features (from most to least 
significant): STATUS, AGE, WORK, INDUSTRY, AA, CRIME, 
LARGE, HIGHWAY, TAX. But these features are highly correlated, 
and the set used can be reduced to the first three with just a small 








~ 








316 WORKED OUT EXAMPLES 


increase in error of about 0.005. Nevertheless, feature selection in gen- 
eral does not seem to help much for this data set. 


9.1.5 Complex classifiers 


The fact that 1dc already performs so well on the original data indicates 
that the data is almost linearly separable. A visual inspection of the 
scatter plots in Figure 9.1 seems to strengthen this hypothesis. It becomes 
even more apparent after the training of a linear support vector classifier 
(svc({],‘p’,1)) and a Fisher classifier (fisherc), both with a cross- 
validation error of 13.0%, the same as for 1dc. 

Given that 1dc and parzenc thus far performed best, we might try 
to train a number of classifiers based on these concepts, for which we are 
able to tune classifier complexity. Two obvious choices for this are 
neural networks and support vector classifiers (SVCs). Starting with 
the latter, we can train SVCs with polynomial kernels of degree close 
to 1, and with radial basis kernels of radius close to 1. By varying the 
degree or radius, we can vary the resulting classifier’s nonlinearity: 


Listing 9.6 


load housing.mat; % Load the housing dataset 
w_pre=scalem([], ‘variance’); % Scaling mapping 
degree=1:3; % Set range of parameters 


ragius = T0, 25737 
for i=1:length (degree) 
err_svc_p(i)=... % Train polynomial SVC 
crossval (z,w_pre*svc([],’p’,degree(i)),5); 
end; 
for i=1:length(radius) 
êri- sve or (i) 5 23 % 
crossval(z,w_pre*svc([], ‘r’,radius 
end; 
figure; clf; plot (degree,err_svc_p); 
figure; clf; plot (radius,err_svc_r); 


Train radial basis SVC 
(1)),5); 


The results of a single repetition are shown in Figure 9.2: the optimal 
polynomial kernel SVC is a quadratic one (a degree of 2), with an average 
error of 12.5%, and the optimal radial basis kernel SVC (a radius of 
around 2) is slightly better, with an average error of 11.9%. Again, note 
that we should really repeat the experiment a number of times, to get an 
impression of the variance in the results. The standard deviation is 
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Figure 9.2 Performance of a polynomial kernel SVC (left, as a function of the 
degree of the polynomial) and a radial basis function kernel SVC (right, as a function 
of the basis function width) 


between 0.2% to 1.0%. This means that the minimum in the right 
subfigure is indeed significant, and that the radius parameter should 
indeed be around 2.0. On the other hand, in the left subplot the graph 
is basically flat and a linear SVC is therefore probably to be preferred. 

For the sake of completeness, we can also train feed-forward neural 
networks with varying numbers of hidden layers and units. In PRTools, 
there are three routines for training feed-forward neural networks. The 
bpxnc function trains a network using the back-propagation algorithm, 
which is slow but does not often overtrain it. The 1mnc function uses a 
second-order optimization routine, Levenberg—Marquardt, which 
speeds up training significantly but often results in overtrained net- 
works. Finally, the neurc routine attempts to counteract the overtrain- 
ing problem of 1mnc by creating an artificial tuning set of 1000 samples 
by perturbing the training samples (see gendatk) and stops training 
when the error on this tuning set increases. It applies 1mnc three times 
and returns the neural network giving the best result on the tuning set. 
Here, we apply both bpxnc and neurc: 


Listing 9.7 


load housing.mat; % Load the housing dataset 
w_pre=scalem([], ‘variance’); % Scaling mapping 
networks = {bpxnc, neurc}; % Set range of parameters 


nlayers=1:2; 
nunits= [4 812 16 20 30 40]; 
for i=1:length (networks) 
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for j=1:length(nlayers) 
for k=1:length(nunits) 
% Train a neural network with nlayers(j) hidden layers 
% of nunits(k) units each, using algorithm network{i} 
err_nn(i,j,k) =crossval(z, 


w_pre*networks{i}([],ones(1,nlayers(j))*nunits(k)),5); 
end; 
end; 
figure; clear all; % Plot the errors 
plot (nunits,err_nn(i,1,:), +=); holdon; 


plot (nunits,err_nn(i,2,:), ‘'--'); 
legend(‘1 hidden layer’, ‘2 hidden layers’); 
end; 


Training neural networks is a computationally intensive process; and 
here they are trained for a large range of parameters, using cross-validation. 
The algorithm above takes more than a day to finish on a modern 
workstation, although per setting just a single neural network is trained. 

The results, shown in Figure 9.3, seem to be quite noisy. After repeat- 
ing the algorithm several times, it appears that the standard deviation is 
in the order of 1%. Ideally, we would expect the error as a function of 
the number of hidden layers and units per hidden layer to have a clear 
global optimum. For bpxnc, this is roughly the case, with a minimal 
cross-validation error of 10.5% for a network with one hidden layer of 
30 units, and 10.7% for a network with two hidden layers of 16 units. 
Normally, we would prefer to choose the network with the lowest 
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Figure 9.3 Performance of neural networks with one or two hidden layers as a 
function of the number of units per hidden layer, trained using bpxnc (left) and neurc 
(right) 


TIME-OF-FLIGHT ESTIMATION OF AN ACOUSTIC TONE BURST 319 


complexity as the variance in the error estimate would be lowest. How- 
ever, the two optimal networks here have roughly the same number of 
parameters. So, there is no clear best choice between the two. 

The cross-validation errors of networks trained with neurc show 
more variation (Figure 9.3). The minimal cross-validation error is again 
10.5% for a network with a single hidden layer of 30 units. Given that 
the graph for bpxnc is much smoother than that of neurc, we would 
prefer to use a bpxnc-trained network. 


9.1.6 Conclusions 


The best overall result on the housing data set was obtained using a 
bpxnc-trained neural network (10.5% cross-validation error), slightly 
better than the best SVC (11.9%) or a simple linear classifier (13.0%). 
However, remember that neural network training is a more noisy pro- 
cess than training an SVC or linear classifier: the latter two will find the 
same solution when run twice on the same data set, whereas a neural 
network may give different results due to the random initialization. 
Therefore, using an SVC in the end may be preferable. 

Of course, the analysis above is not exhaustive. We could still have tried 
more exotic classifiers, performed feature selection using different criteria 
and search methods, searched through a wider range of parameters for the 
SVCs and neural networks, investigated the influence of possible outlier 
objects and so on. However, this will take a lot of computation, and for 
this application there seems to be no reason to believe we might obtain 
a significantly better classifier than those found above. 


9.2 TIME-OF-FLIGHT ESTIMATION OF AN 
ACOUSTIC TONE BURST 


The determination of the time of flight (ToF) of a tone burst is a key 
issue in acoustic position and distance measurement systems. The length 
of the acoustic path between a transmitter and receiver is proportional to 
the ToF, that is, the time evolved between the departure of the waveform 
from the transmitter and the arrival at the receiver (Figure 9.4). 
The position of an object is obtained, for instance, by measuring the 
distances from the object to a number of acoustic beacons. See also 
Figure 1.2. 
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Figure 9.4 Set-up of a sensory system for acoustic distance measurements 


The quality of an acoustic distance measurement is directly related to 
the quality of the ToF determination. Electronic noise, acoustic noise, 
atmospheric turbulence and temperature variations are all factors that 
influence the quality of the ToF measurement. In indoor situations, 
objects in the environment (wall, floor, furniture, etc.) may cause echoes 
that disturb the nominal response. These unwanted echoes can cause 
hard-to-predict waveforms, thus making the measurement of the ToF 
a difficult task. 

The transmitted waveform can take various forms, for instance, 
a frequency modulated (chirped) continuous waveform (CWFM), a 
frequency or phase shift-keyed signal or a tone burst. The latter is a 
pulse consisting of a number of periods of a sine wave. An advantage of 
a tone burst is that the bandwidth can be kept moderate by adapting the 
length of the burst. Therefore, this type of signal is suitable for use in 
combination with piezoelectric transducers, which are cheap and robust, 
but have a narrow bandwidth. 

In this section, we design an estimator for the determination of the 
ToFs of tone bursts that are acquired in indoor situations using a set-up 
as shown in Figure 9.4. The purpose is to determine the time delay 
between sending and receiving a tone burst. A learning and evaluation 
data set is available that contains 150 records of waveforms acquired in 
different rooms, different locations in the rooms, different distances and 
different heights above the floor. Figure 9.4 shows an example of one of 
the waveforms. Each record is accompanied by a reference ToF indica- 
ting the true value of the ToF. The standard deviation of the reference 
ToF is estimated at 10 (us). The applied sampling period is A = 2 (us). 
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The literature roughly mentions three concepts to determine the ToF, 
i.e. thresholding, data fitting (regression) and ML (maximum likelihood) 
estimation (Heijden van der etal., 2003). Many variants of these con- 
cepts have been proposed. This section only considers the main repre- 
sentatives of each concept: 


Comparing the envelope of the wave against a threshold that is 
proportional to the magnitude of the waveform. 

Fitting a one-sided parabola to the foot of the envelope of the 
waveform. 

Conventional matched filtering. 

Extended matched filtering based on a covariance model of the signal. 


The first one is a heuristic method that does not optimize any criterion 
function. The second one is a regression method, and as such a repre- 
sentative of data fitting. The last two methods are ML estimators. The 
difference between these is that the latter uses an explicit model of 
multiple echoes. In the former case, such a model is missing. 

The section first describes the methods. Next, the optimization of the 
design parameters using the data set is explained, and the evaluation is 
reported. The data set and the MATLAB listings of the various methods 
can be found on the accompanying website. 


9.2.1 Models of the observed waveform 


The moment of time at which a transmission begins is well defined since 
it is triggered under full control of the sensory system. The measurement 
of the moment of arrival is much more involved. Due to the narrow 
bandwidth of the transducers the received waveform starts slowly. A low 
SNR makes the moment of arrival indeterminate. Therefore, the design 
of a ToF estimator requires the availability of a model describing the 
arriving waveform. This waveform w(t) consists of three parts: the 
nominal response a-h(f—t) to the transmitted wave; the interfering 
echoes a-r(t — t); and the noise v(t): 


w(t) =a-h(t —t) + a-r(t — t) + v(t) (9.1) 





We assume that the waveform is transmitted at time t = 0, so that zt is 
the ToF. (9.1) simply states that the observed waveform equals the 
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nominal response h(t), but now time-shifted by t and attenuated by a. 
Such an assumption is correct for a medium like air because, within the 
bandwidth of interest, the propagation of a waveform through air does 
not show a significant dispersion. The attenuation coefficient a depends 
on many factors, but also on the distance, and thus also on t. However, 
for the moment we will ignore this fact. The possible echoes are 
represented by a-r(t — t). They share the same time shift t because no 
echo can occur before the arrival of the nominal response. The addi- 
tional time delays of the echoes are implicitly modelled within r(t). The 
echoes and the nominal response also share a common attenuation 
factor. The noise v(t) is considered white. 

The actual shape of the nominal response h(t) depends on the choice 
of the tone burst and on the dynamic properties of the transducers. 
Sometimes, a parametric empirical model is used, for instance: 


h(t) = t” exp(—t/T) cos(2aft+ y) t>0 (9.2) 


f is the frequency of the tone burst; cos (2rft + p) is the carrier; and 
t” exp(—t/T) is the envelope. The factor t” describes the rise of the 
waveform (m is empirically determined; usually between 1 and 3). The 
factor exp (—t/T) describes the decay. Another possibility is to model h(t) 
non-parametrically. In that case, a sampled version of h(t), obtained in 
an anechoic room where echoes and noise are negligible, is recorded. 
The data set contains such a record. See Figure 9.5. 

Often, the existence of echoes is simply ignored, r(t) = 0. Sometimes, 
a single echo is modelled r(t) = dı - h(t — t1) where 1, is the delay of the 
echo with respect to t = t. The most extensive model is when multiple 
echoes are considered r(t) = 5°, dyh(t — tz). The sequences d, and t are 
hardly predictable and therefore regarded as random. In that case, r(t) 


nominal response h(t) 


VI) AWWW 






























































0 0.2 0.4 0.6 0.8 1 1.2 


Figure 9.5 A record of the nominal response h(t) 
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becomes random too, and the echoes are seen as disturbing noise with 
non-stationary properties. 

The observed waveform z= [zo ... ZN-1] 
of w(t): 


T is a sampled version 


Zn = (nA) (9.3) 


A is the sampling period. N is the number of samples. Hence, NA is the 
registration period. With that, the noise v(t) manifests itself as a random 
vector v, with zero mean and covariance matrix Cy = o7I. 


9.2.2 Heuristic methods for determining the ToF 


Some applications require cheap solutions that are suitable for direct 
implementation using dedicated hardware, such as instrumental elec- 
tronics. For that reason, a popular method to determine the ToF is 
simply thresholding the observed waveform at a level T. The estimated 
ToF ĉpres is the moment at which the waveform crosses a threshold 
level T. 

Due to the slow rising of the nominal response, the moment ĉ;pres of 
level crossing appears just after the true t, thus causing a bias. Such a 
bias can be compensated afterwards. The threshold level T should be 
chosen above the noise level. The threshold operation is simple to 
realize, but a disadvantage is that the bias depends on the magnitude 
of the waveform. Therefore, an improvement is to define the threshold 
level relative to the maximum of the waveform, that is T = a max (w/(t)). 
a is a constant set to, for instance, 30%. 

The observed waveform can be written as g(t) cos (2z7ft + p) + n(t) 
where g(t) is the envelope. The carrier cos (27ft + p) of the waveform 
causes a resolution error equal to 1/f. Therefore, rather than applying 
the threshold operation directly, it is better to apply it to the envelope. 
A simple, but inaccurate method to get the envelope is to rectify the 
waveform and to apply a low-pass filter to the result. The optimal method, 
however, is much more involved and uses quadrature filtering. A simpler 
approximation is as follows. First, the waveform is band-filtered to 
reduce the noise. Next, the filtered signal is phase-shifted over 90° to 
obtain the quadrature component q(t) = g(t) sin(2aft + p) + 1,(t). 








Finally, the envelope is estimated using g(t) = yV we. dtitered (t) + 47t). 


Figure 9.6 provides an example. 
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Figure 9.6 ToF measurements based on thresholding operations 


The design parameters are the relative threshold a, the bias compen- 
sation b, and the cut-off frequencies of the band-filter. 


9.2.3 Curve fitting 


In the curve-fitting approach, a functional form is used to model the 
envelope. The model is fully known except for some parameters, one of 
which is the ToF. As such, the method is based on the regression 
techniques introduced in Section 3.3.3. On adoption of an error criterion 
between the observed waveform and the model, the problem boils down 
to finding the parameters that minimize the criterion. We will use the 
SSD criterion discussed in Section 3.3.1. The particular function that will 
be fitted is the one-sided parabola defined by: 


2 . 
f (t,x) = + xı(t = x2) if t> x2 (9.4) 
x0 elsewhere 


The final estimate of the ToF is Tene = x2. 

The function must be fitted to the foot of the envelope. Therefore, an 
important task is to determine the interval tp < t < te that makes up the 
foot. The choice of t, and te is critical. If the interval is short, then the 
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noise sensitivity is large. If the interval is too large, modelling errors 
become too influential. The strategy is to find two anchor points tı and 
t2 that are stable enough under the various conditions. 

The first anchor point tı is obtained by thresholding a low-pass 
filtered version Zfiterea(t) of the envelope just above the noise level. 
If o, is the standard deviation of the noise in w(t), then Zfitterea(t) is 
thresholded at a level 3 to, thus yielding t;. The standard deviation ø, 
is estimated during a period preceding the arrival of the tone. A second 
suitable anchor point tz is the first location just after tı where Zfiltterea (t) is 
maximal, i.e. the location just after tı where dg¢itrereq(t)/dt = 0. te is 
defined midway between tı and t) by thresholding géittereq(t) at a level 
340, + A(Sfittered(t2) -—340.). Finally, tp is calculated as t, = tı — 6 
(te — tı). Figure 9.7 provides an example. 

Once the interval of the foot has been established, the curve can be 
fitted to the data (which in this case is the original envelope g(t)). Since 
the curve is nonlinear in x, an analytical solution, such as in the poly- 
nomial regression in Section 3.3.3, cannot be applied. Instead a numer- 
ical procedure, for instance using MATLAB’s fminsearch, should be 
applied. 

The design parameters are a, 3 and the cut-off frequency of the low- 
pass filter. 


one-sided parabola fitted to the envelope 
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Figure 9.7 ToF estimation by fitting a one-sided parabola to the foot of the 
envelope 
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9.2.4 Matched filtering 


This conventional solution is achieved by neglecting the reflections. The 
measurements are modelled by a vector z with N elements: 


Zn = ah(nA — 1) + v(nA) (9.5) 
The noise is represented by a random vector v with zero mean and 


covariance matrix Cy = o?I. Upon introduction of a vector h(t) with 
elements (nA — t) the conditional probability density of z is: 





n 


pehi) =m exp (555 (2 ah(n))"z—ahe))) (9.6 


Maximization of this expression yields the maximum likelihood estimate 
for t. In order to do so, we only need to minimize the L norm of z — ah(t): 


(z — ah(t))" (z — ah(t)) = 272 + a2h(t)" h(t) — 2az7h(z) (9.7) 


The term z’z does not depend on t and can be ignored. The second term 
is the signal energy of the direct response. A change of t only causes a 
shift of it. But, if the registration period is long enough, the signal energy 
is not affected by such a shift. Thus, the second term can be ignored as 
well. The maximum likelihood estimate boils down to finding the t that 
maximizes az! h(t). A further simplification occurs if the extent of h(t) is 
limited to, say, KA with K << N. In that case az'h(t) is obtained by 
cross-correlating z, by a-h(nA + 1): 


y(t) =a) h(kA-—ī)zk (9.8) 


The value of t which maximizes y(t) is the best estimate. The operator 
expressed by (9.8) is called a matched filter or a correlator. Figure 9.8 
shows a result of the matched filter. Note that apart from its sign, the 
amplitude a does not affect the outcome of the estimate. Hence, the fact 
that a is usually unknown doesn’t matter much. Actually, if the nominal 
response is given in a non-parametric way, the matched filter doesn’t 
have any design parameters. 
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observed waveform 
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Figure 9.8 Matched filtering 


9.2.5 ML estimation using covariance models for the 
reflections 


The matched filtered is not designed to cope with interfering reflections. 
Especially, if an echo partly overlaps the nominal response, the results 
are inaccurate. In order to encompass situations with complex interfer- 
ence patterns the matched filter must be extended. A possibility is to 
model the echoes explicitly. A tractable model arises if the echoes are 
described by a non-stationary autocovariance function. 


Covariance models 


The echoes are given by r(t) = >>, dh(t — tg). The points in time, t,, are 
a random sequence. Furthermore we have t > 0 since all echoes appear 
after the arrival of the direct response. The attenuation factors d, have a 
range of values. We will model them as independent Gaussian random 
variables with zero mean and variance coż. Negative values of dą are 
allowed because of the possible phase reversal of an echo. We limit the 
occurrence of an echo to an interval 0 < t} < T, and assume a uniform 
distribution. Then the autocovariance function of a single echo is: 


Cp (ti, t2) = Eldgh(ty = Th) h( ty = Tk) | 


2 pT 9.9 
= af h(ty — Tp) (ty — Th) ATp ( ) 
T Tp=0 
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If there are K echoes, the autocovariance function of r(t) is 
Ko T 
C, (t1, t2) — KC, (t1, t2) = 7! f h(t = Tp) (ta = Tp) dTR (9.10) 
Tp=0 


because the factors d, and random points t, are independent. 

For arbitrary t, the reflections are shifted accordingly. The sampled ver- 
sion of the reflections is r(nA — t) which can be brought in a vector r(t). 
The elements C,),(72,772) of the covariance matrix C, of r(t), conditioned 
on t, become: 


Cy.(2,m) = C, (nA = 1, mA — 1) (9.11) 


If the registration period is sufficiently large, the determinant |C,,,| does 
not depend on t. 

The observed waveform w(t) = a(hb(t — t) + r(t — t)) + v(t) involves 
two unknown factors, the amplitude a and the ToF t. The prior prob- 
ability density of the latter is not important because the maximum like- 
lihood estimator that we will apply does not require it. However, the 
first factor a is a nuisance parameter. We deal with it by regarding a as a 
random variable with its own density p(a). The influence of a is inte- 
grated in the likelihood function by means of Bayes’ theorem for condi- 
tional probabilities, i.e. p(z|t) = f p(z\t,a)p(a)da. 

Preferably, the density p(a) reflects our state of knowledge that we 
have about a. Unfortunately, taking this path is not easy, for two 
reasons. It would be difficult to assess this state of knowledge quantita- 
tively. Moreover, the result will not be very tractable. A more practical 
choice is to assume a zero mean Gaussian density for a. With that, 
conditioned on z, the vector ah(t) with elements a-h(nA — t) becomes 
zero mean and Gaussian with covariance matrix: 





Che = o2h(t)h’ (1) (9.12) 


where o? is the variance of the amplitude a. 

At first sight it seems counterintuitive to model a as a zero mean 
random variable since small and negative values of a are not very likely. 
The only reason for doing so is that it paves the way to a mathematically 
tractable model. In Section 9.2.4 we noticed already that the actual value 
of a does not influence the solution. We simply hope that in the extended 
matched filter a does not have any influence either. The advantage is that 
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the dependence of t on z is now captured in a concise model, i.e. a single 
covariance matrix: 


Cae = 02(h(1)h" (1) + Cy) + 071 (9.13) 


This matrix completes the covariance model of the measurements. In the 
sequel, we assume a Gaussian conditional density for z. Strictly speak- 
ing, this holds true only if sufficient echoes are present since in that case 
the central limit theorem applies. 


Maximum likelihood estimation of the time-of-flight 


With the measurements modelled as a zero mean, Gaussian random 
vector with the covariance matrix given in (9.13), the likelihood function 
for t becomes: 





p(z|t) = L exp( jz" Ctz) (9.14) 
) [Czk] 


The maximization of this probability with respect to t yields the max- 
imum likelihood estimate; see Section 3.1.4. Unfortunately, this solution 
is not practical because it involves the inversion of the matrix C,,. The 
size of Cz is N x N where N is the number of samples of the registration 
(which can easily be in the order of 104). 


Principal component analysis 


Economical solutions are attainable by using PCA techniques (Section 
7.1.1). If the registration period is sufficiently large, the determinant 
[Cz] will not depend on t. With that, we can safely ignore this factor. 
What remains is the maximization of the argument of the exponential: 


Alze) 27 Cala (9.15) 
The functional A(z|t) is a scaled version of the log-likelihood function. 
The first computational savings can be achieved if we apply a principal 
component analysis to C}. This matrix can be decomposed as follows: 


N-1 
Cae = 5 An(t) Un (tus (2) (9.16) 
n=0 


330 WORKED OUT EXAMPLES 


An(t) and u,(t) are eigenvalues and eigenvectors of C}. Using (9.16) the 
expression for A(z|t) can be moulded into the following equivalent form: 


A(z|t) = =z" È sowo), 5 tne)" (9.17) 
n=0 n\t 





The computational savings are obtained by discarding all terms in (9.17) that 
do not capture much information about the true value of t. Suppose that A, 
and u, are arranged according to their importance with respect to the 
estimation, and that above some value of n, say J, the importance is negli- 
gible. With that, the number of terms in (9.17) reduces from N to J. Experi- 
ments show that J is in the order of 10. A speed up by a factor of 1000 is 
feasible. 


Selection of good components 


The problem addressed now is how to order the eigenvectors in (9.17) such 
that the most useful components come first, and thus will be selected. The 
eigenvectors u,,(t) are orthonormal and span the whole space. Therefore: 


(z7u,(t))” = izl? (9.18) 











Anl) o2 o 
n=0 n=0 v v 
T À (9.19) 
= An(t) va oy (z'u ( Da lizl| 
n=0 n(T) Oy i o; 


The term containing ||z|| does not depend on t and can be omitted. The 
maximum likelihood estimate for t appears to be equivalent to the one 
that maximizes: 


N-I = 
Yo nOu)? with yl) - ee (9.20) 


The weight 7,(t) is a good criterion to measure the importance of an 
eigenvector. Hence, a plot of the y, versus 7 is helpful to find a reasonable 
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value of J such that J << N. Hopefully, yn is large for the first few n, and 
then drops down rapidly to zero. 


The computational structure of the estimator based on a covariance model 


A straightforward implementation of (9.20) is not very practical. The 
expression must be evaluated for varying values of t. Since the dimen- 
sion of Cz} is large, this is not computationally feasible. 

The problem will be tackled as follows. First, we define a moving 
window for the measurements z,. The window starts at n = i and ends at 
n=i+I-—1. Thus, it comprises I samples. We stack these samples into 
a vector x(i) with elements x,,(i) = z,4;. Each value of i corresponds to a 
hypothesized value t = iA. Thus, under this hypothesis, the vector x(?) 
contains the direct response with t=0, ie. x,(i) =a-h(nA)+ 
a-r(nA) + v(nA). Instead of applying operation (9.20) for varying t, t is 
fixed to zero and z is replaced by the moving window x(i): 


J-1 


yi) = X (0) (xli) "a, (0))? (9.21) 


n=0 


If 7 is the index that maximizes y(i), then the estimate for t is found as 
cum = id. 

The computational structure of the estimator is shown in Figure 9.9. 
It consists of a parallel bank of J filters/correlators, one for each eigen- 
vector u,,(0). The results of that are squared, multiplied by weight factors 
%Yn(0) and then accumulated to yield the signal y(i). It can be proven that 
if we set o2 = 0, i.e. a model without reflection, the estimator degener- 
ates to the classical matched filter. 
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Figure 9.9 ML estimation based on covariance models 
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The design parameters of the estimators are the, SNR aŻloż, the 
duration of echo generation T, the echo strength S, a Ko?, the number 
of correlators J and the window size I. 


Example 


In the following example, the selected design parameters are SNR = 
100, T = 0.8 (ms), S, = 0.2 and I = 1000. Using (9.11) and (9.13), we 
calculate C,,, and from that the eigenvectors and corresponding eigen- 
values and weights are obtained. Figure 9.10 shows the result. As 
expected the response of the first filter/correlator is similar to the direct 
response, and this part just implements the conventional matched filter. 
From the seventh filter on, the weights decrease, and from the fifteenth 
filter on the weights are near zero. Thus, the useful number of filters is 
between 7 and 15. Figure 9.11 shows the results of application of the 
filters to an observed waveform. 


9.2.6 Optimization and evaluation 


In order to find the estimator with the best performance the best par- 
ameters of the estimators must be determined. The next step then is to 
assess their performances. 


Cross-validation 


In order to prevent overfitting we apply a three-fold cross-validation 
procedure to the data set consisting of 150 records of waveforms. The 
corresponding MATLAB code is given in Listing 9.8. Here, it is assumed 
that the estimator under test is realized in a MATLAB function called 
ToF_estimator(). 

The optimization of the operator using a training set occurs according 
to the procedure depicted in Figure 9.12. In Listing 9.8 this is imple- 
mented by calling the function opt_ToF_estimator(). The actual 
code for the optimization, given in Listing 9.9, uses the MATLAB function 
fminsearch(). 

We assume here that the operator has a bias that can be compensated 
for. The value of the compensation is just another parameter of the 
estimator. However, for its optimization the use of the function 
fminsearch() is not needed. This would unnecessarily increase the 
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filter responses 


10! p A(n) 

















Figure 9.10 Eigenvalues, weights and filter responses of the covariance model 
based estimator 


search space of the parameters. Instead, we simply use the (estimated) 
variance as the criterion to optimize, thereby ignoring a possible bias for 
a moment. As soon as the optimal set of parameters has been found, the 
corresponding bias is estimated afterwards by applying the optimized 
estimator once again to the learning set. 

Note, however, that the uncertainty in the estimated bias causes a 
residual bias in the compensated ToF estimate. Thus, the compensation 
of the bias does not imply that the estimator is necessarily unbiased. 
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observed waveform 
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Figure 9.11 Results of the estimator based on covariance models 


Therefore, the evaluation of the estimator should not be restricted to 
assessment of the variance alone. 


Listing 9.8 
MATLAB listing for cross-validation. 


load tofdata.mat; % Load tof dataset containing 150 waveforms 
Npart =3; % Number of partitions 
Nchunk=50; % Number of waveforms in one partition 


% Create 3 random partitions of the data set 
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p=randperm(Npart*Nchunk) ; % Find random permutation of 1:150 


forn=1:Npart 
for i=1:Nchunk 
Zp{n,i}=Zraw{p((n—1)*Nchunk+i)}; 


Tp (n,i) =TOFindex(p((n 


end 
end 





% Cross-validation 
forn=1:Npart 
% Create a learn set and an evaluation set 
Zlearn=Zp; 


Tlearn=Tp; 
for i=1:Npart 


1)*Nchunk+i)); 


if (i ==—n) 
Zlearn(i,:) =[]; Tlearn(i,:)=[]; 
Zeval=Zp(i,:); Teval=Tp(i,:); 
end 
end 


Zlearn=reshape(Zlearn, (Npart-1) *Nchunk,1); 
Tlearn=reshape(Tlearn,1, (Npart-1) *Nchunk) ; 


Zeval = reshape (Zeval ,1, Nchunk) ; 


% Optimize a ToF estimator 
[parm, learn_variance, learn_bias]=... 
opt_ToF_estimator (Zlearn,Tlearn); 


2 





% Evaluate the estimator 


for i=1:Nchunk 
index (i) =ToF_estimator (Zeval{i},parm) ; 


end 


variance (n) =var(Teval-index) ; 
bias (n) =mean(Teval-index-learn_bias) ; 


end 


Figure 9.12 
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Listing 9.9 
MATLAB listing for the optimization of a ToF estimator. 


function [parm,variance,bias]=... 
opt_ToF_estimator (Zlearn, Tlearn) 


% Optimize the parameters of a ToF estimator 

parm= [0.15, 543, 1032]; % Initial parameters 

parm= fminsearch(@objective,parm, [],Zlearn,Tlearn) ; 

[variance,bias] =objective (parm, Zlearn,Tlearn) ; 
return; 


% Objective function: 
% estimates the variance (and bias) of the estimator 
function [variance, bias] =objective (parm, Z, TOFindex) 
for i=1:length(Z) 
index (i) =ToF_estimator(Z{i},parm) ; 
end 
variance=var (TOFindex - index) ; 
bias =mean (TOFindex - index); 
return 


Results 


Table 9.2 shows the results obtained from the cross-validation. The first 
row of the table gives the variances directly obtained from the training 
data during optimization. They are obtained by averaging over the three 
variances of the three partitions. The second row tabulates the variances 
obtained from the evaluation data (also obtained by averaging over the 
three partitions). Inspection of these results reveals that the threshold 
method is overfitted. The reason for this is that some of the records have 
a very low signal-to-noise ratio. If — by chance — these records do not 


Table 9.2 Results of three-fold cross-validation of the four ToF estimators 





Envelope Curve Matched CVM based 
thresholding fitting filtering estimator 
Variance 356 378 14339 572 
(learn data) (us?) 
Variance 181010 384 14379 591 
(test data) (us?) 
corrected 180910 284 14279 491 
variance (us?) 
Bias (us) 35 2 10 2 
RMS (us) 427425 1741.2 11947 2241.4 
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occur in the training set, then the threshold level will be set too low for 
these noisy waveforms. 

The reference values of the ToFs in the data set have an uncertainty of 
10 (us). Therefore, the variance estimated from the evaluation sets have 
a bias of 100 (us*). The third row in the table shows the variances after 
correction for this effect. 

Another error source to account for is the residual bias. This error can 
be assessed as follows. The statistical fluctuations due to the finite data 
set cause uncertainty in the estimated bias. Suppose that ø? is the 
variance of a ToF estimator. Then, according to (5.11), the bias esti- 
mated from a training set of Ns samples has a variance of o7/Ns. The 
residual bias has an order of magnitude of a/\/Ns. In the present case, o? 
includes the variance due to the uncertainty in the reference values of the 
ToFs. The calculated residual biases are given in the fourth row in Table 
9.2. 

The final evaluation criterion must include both the variance and the 
bias. A suitable criterion is the mean square error, or equivalently the 





root mean square error (RMS) defined as RMS = V variance + bias’. 
The calculated RMSs are given in the fifth row. 


Discussion 


Examination of Table 9.2 reveals that the four methods are ranked 
according to: curve fitting, CVM! based, matched filtering and finally 
thresholding. The performance of the threshold method is far behind the 
other methods. The reason of the failure of the threshold method is a 
lack of robustness. The method fails for a few samples in the training set. 
The robustness is simply improved by increasing the relative threshold, 
but at the cost of a larger variance. For instance, if the threshold is raised 
to 50%, the variance becomes around 900 (us?). 

The poor performance of the matched filter is due to the fact that it 
is not able to cope with the echoes. The covariance model clearly 
helps to overcome this problem. The performance of the CVM 
method is just below that of curve fitting. Apparently, the covariance 
model is an important improvement over the simple white noise model 
of the matched filter, but still too inaccurate to beat the curve-fitting 
method. 





1 CVM = covariance model 
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One might argue that the difference between the performances of 
curve fitting and CVM is not a statistically significant one. The dom- 
inant factor in the uncertainty of the RMSs is brought forth by the 
estimated variance. Assuming Gaussian distributions for the random- 
ness, the variance of the estimation error of the variance is given by 


(5.13): 
Var[o2] = <o (9.22) 


where Ns = 150 is the number of samples in the data set. This error 
propagates into the RMS with a standard deviation of o//2Ns. The 
corresponding margins are shown in Table 9.2. It can be seen that the 
difference between the performances is larger than the margins, thus 
invalidating the argument. 

Another aspect of the design is the computational complexity of the 
methods. The threshold method and the curve-fitting method both 
depend on the availability of the envelope of the waveform. The quad- 
rature filtering that is applied to calculate the envelope requires the 
application of the Fourier transform. The waveforms in the data set are 
windowed to N = 8192 samples. Since N is a power of two, MATLAB’s 
implementations of the fast Fourier transforms, fft () and ifft(), 
have a complexity of 2N* log, N. The threshold method does not need 
further substantial computational effort. The curve-fitting method 
needs an additional numerical optimization, but since the number of 
points of the curve is not large, such an optimization is not very 
expensive. 

A Fourier-based implementation of a correlator also has a complex- 
ity of 2N? log, N. Thus, the computational cost of the matched filter 
is in the same order as the envelope detector. The CVM based method 
uses J correlators. Its complexity is (J +1)N* log, N. Since typically 
J =10, the CVM method is about 10 times more expensive than the 
other methods. 

In conclusion, the most accurate method for ToF estimation appears 
to be the curve-fitting method with a computational complexity that is in 
the order of 2N*log, N. With this method, ToF estimation with an 
uncertainty of about 17 (us) is feasible. The maximum likelihood esti- 
mator based on a covariance model follows the curve-fitting method 
closely with an uncertainty of 22 (us). Additional modelling of the 
occurrence of echoes is needed to improve its performance. 
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9.3. ONLINE LEVEL ESTIMATION IN AN 
HYDRAULIC SYSTEM 


This third example considers the hydraulic system already introduced in 
Section 8.1, where it illustrated some techniques for system identifica- 
tion. Figure 8.2 shows an overview of the system. The goal in the present 
section is to design an online state estimator for the two levels h4 (t) and 
h>(t), and for the input flow qo. 

The model that Cane to be best i r Torricelli’s model. In discrete 
time, with x4(i and x2(i ya hy (id), the model of (8.3) 
becomes: 


le + a k | A 
— + — 
x2(i+ 1) x2 (i) C 


Rı and R2 are two constants that depend on the areas of the cross- 
sections of the pipelines of the two tanks. C is the capacity of the two 
tanks. These parameters are determined during the system identification. 
A = 5 (s) is the sampling period. 

The system in (9.23) is observable with only one level sensor. However, 
in order to allow consistency checks (which are needed to optimize and to 
evaluate the design) the system is provided with two level sensors. The 
redundancy of sensors is only needed during the design phase, because once 
a consistently working estimator has been found, one level sensor suffices. 

The experimental data that is available for the design is a record of the 
measured levels as shown in Figure 9.13. The levels are obtained by 
means of pressure measurements using the Motorola MPX2010 pressure 
sensor. Some of the specifications of this sensor are given in Table 9.3. 
The full-scale span Vpss corresponds to Pmax =10kPa. In turn, 
this maximum pressure corresponds to a level of hmax = Pmax/pg 
æ 1000 (cm). Therefore, the linearity is specified between —10 (cm) 
and +10 (cm). This specification is for the full range. In our case, the 
swing of the levels is limited to 20 (cm), and the measurement system 
was calibrated at 0 (cm) and 25 (cm). Therefore, the linearity will be 
much less. The pressure hysteresis is an error that depends on whether 
the pressure is increasing or decreasing. The pressure hysteresis can induce 
a maximal level error between —1 (cm) and +1 (cm). Besides these 
sensor errors, the measurements are also contaminated by electronic noise 





o a )) + gol?) 


J Ril (x1 (i = x2(i — /R2x2/(i) 
(9.23) 
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Figure 9.13 Measured levels of two interconnected tanks 


Table 9.3 Specifications of the MPX2010 pressure sensor 





Characteristic Symbol Min Typ Max Unit 
Pressure range P 0 — 10 kPa 
Full-scale span Vess 24 25 26 mV 
Sensitivity AV/AP — 2.5 — mV/kPa 
Linearity error — —1.0 — 1.0 %Vess 
Pressure hysteresis — — +0.1 — %Vess 


modelled by white noise v(i) with a standard deviation of 
dy = 0.04 (cm). With that, the model of the sensory system becomes: 


zi(i) | _ | x1(4) e1(i) v1 (2) 
eal [eal + lea) + [28] ee 
or more concisely: z(i) = Hx(i) + e(i) + v(i) with H =I. The error e(#) 
represents the linearity and hysteresis error. 

In the next sections, the linearized Kalman filter, the extended Kalman 
filter and the particle filter will be examined. In all three cases a model is 
needed for the input flow qo(i). The linearized Kalman filter can only 
handle linear models. The extended Kalman filter can handle nonlinear 
models, but only if the nonlinearities are smooth. Particle filtering offers 
the largest freedom of modelling. All three models need parameter 
estimation in order to adapt the model to the data. Consistency checks 
must indicate which estimator is most suitable. 
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The prior knowledge that we assume is that the tanks at i= 0 are 
empty. 


9.3.1 Linearized Kalman filtering 


The simplest dynamic model for go(i) is the first order AR model: 


qoli + 1) = Jo + a(goli) — do) + wÙ (9.25) 


Jo is a constant input flow that maintains the equilibrium in the tanks. 
The Gaussian process noise w(i) causes random perturbations around 
this equilibrium. The factor œ regulates the bandwidth of the fluctu- 
ations of the input flow. The design parameters are Jo, ow and a. 

The augmented state vectors are x(i) = [ x1(i) x2(i) go(i) ]. Equations 
(9.23) and (9.25) make up the augmented state equation 
x(i+ 1) = f(x(i), w(i)), which is clearly nonlinear. The equilibrium fol- 
lows from equating X = f(x, 0): 














p= =2 
|- zi Rı (x1 —X2) + Jo X2 = Jo/R2 
= — = = Ry + R2 — 
x2 x2 4/R 1(%1 — 72) EN RX MAPS Ri x2 
(9.26) 
The next step is to calculate the Jacobian matrix of f(): 
—— AR _ — AR _ A 
of 2G: Ri(x1 —x2) 2C Ri(x1—x2) C 
F(x) = o = AR, 1 AR AR) 
x 2C Rı(x1—x2) 2G Ri(x1—x2) 2Cy/ R2x2) 
0 0 Q 
(9.27) 


The linearized model arises by application of a truncated Taylor series 
expansion around the equilibrium 


x(i+ 1) = X + F(X) (x(a) — X) + Gw(i) (9.28) 


wihG= [0 0 1]! 
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The three unknown parameters Jo, a and cw must be estimated 
from the data. Various criteria can be used to find these parameters, 
but in most of these criteria the innovations play an important role. 
For a proper design the innovations are a white Gaussian sequence of 
random vectors with zero mean and known covariance matrix, i.e. the 
innovation matrix. See Section 8.4. In the sequel we will use the NIS 
(normalized innovations squared; Section 8.4.2). For a consistent 
design, the NIS must have a y%, distribution. In the present case 
N = 2. The expectation of a %,-distributed variable is N. Thus, here, 
ideally the mean of the NIS is 2. A simple criterion is one which 
drives the average of the calculated NIS to 2. A possibility is: 


Jo, &, Ow) = (9.29) 





1 I-1 
oe Nis(i) — 2 
i=0 





The values of Jọ, a and cw that minimize J are considered optimal. 

The strategy is to realize a MATLAB function J = £1inDKF(y) that 
implements the linearized Kalman filter, applies it to the data, calculates 
Nis(i) and returns J according to eq. (9.29). The input argument y is an 
array containing the variables Jọ, a and cw. Their optimal values are 
found by one of MATLAB’s optimization functions, e.g. [parm,J] = 
fminsearch(@flinDKF, [20 0.95 5]). 

Application of this strategy revealed an unexpected phenomenon. The 
solution of (Jo, a, Cw) that minimizes J is not unique. The values of a and 
Ow do not influence J as long as they are both sufficiently large. For 
instance, Figure 9.14 shows the results for a = 0.98 and o, = 1000 
(cm?/s). But any value of a and o,, above this limit gives virtually the 
same results and the same minimum. The minimum obtained is J = 25.6. 
This is far too large for a consistent solution. Also the NIS, shown in 
Figure 9.14, does not obey the statistics of a x3 distribution. During the 
transient, in the first 200 seconds, the NIS reaches extreme levels indi- 
cating that some unmodelled phenomena occur there. But also during 
the remaining part of the process the NIS shows some unwanted high 
peaks. Clearly the estimator is not consistent. 

Three modelling errors might explain the anomalous behaviour of the 
linearized Kalman filter. First, the linearization of the system equation 
might fail. Second, the AR model of the input flow might be inappropri- 
ate. Third, the ignorance of possible linearization errors of the sensors 
might not be allowed. In the next two sections, the first two possible 
explanations will be examined. 








ONLINE LEVEL ESTIMATION IN AN HYDRAULIC SYSTEM 343 


20 - estimated levels (solid) and measurements (dotted) (cm) 








1 i L 1 1 1 1 
0 500 1000 1500 2000 2500 3000 3500 


150 - estimated input flow (cm/s) 





oH i Idella | Wi k M i 
0 500 1000 1500 2000 2500 3000 3500 











0 500 1000 1500 2000 2500 3000 3500 


t(s) 


Figure 9.14 Results from the linearized Kalman filter 


9.3.2 Extended Kalman filtering 


Linearization errors of the system equation are expected to be influential 
when the levels deviate largely from the equilibrium state. The abnor- 
mality during the transient in Figure 9.14 might be caused by this kind of 
error because there the nonlinearity is strongest. The extended Kalman 
filter is able to cope with smooth linearity errors. Therefore, this method 
might give an improvement. 

In order to examine this option, a MATLAB function J = fextDKF (y) 
was realized that implements the extended Kalman filter. Again, y is an 
array containing the variables gy, a and cw. Application to the data, 
calculation of J and minimization with fminsearch () yields estimates 
as shown in Figure 9.15. 

The NIS of the extended Kalman filter, compared with that of the 
linearized Kalman filter, is much better now. The optimization criter- 
ion J attains a minimal value of 3.6 instead of 25.6; a significant 
improvement has been reached. However, the NIS still doesn’t obey a 
x3 distribution. Also, the optimization of the parameters qo, a and ow 
is not without troubles. Again, the minimum is not unique. Any solu- 
tion of a and oy satisfies as long as both parameters are sufficiently 
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Figure 9.15 Results from the extended Kalman filter 


large. Moreover, it now appears that the choice of gy does not have any 
influence at all. 

The explanation of this behaviour is as follows. First of all, we 
observe that in the prediction step of the Kalman filter a large value of 
Ow induces a large uncertainty in the predicted input flow. As a result, 
the prediction of the level in the first tank is also very uncertain. 
In the next update step, the corresponding Kalman gain will be close 
to one, and the estimated level in the first tank will closely follow the 
measurement, X,(i|/) ~ zı(i). If œ is sufficiently large, so that the 
autocorrelation in the sequence qo(i) is large, the estimate Go(i|i) is 
derived from the differences of succeeding samples %;(i\i) and 
xı(i— 1|i— 1). Since x4 (ii) © zı(i) and x1(i— 1|i— 1) ~ zı(i— 1), the 
estimate Joliļji) only depends on measurements. The value of Jọ does 
not influence that. The conclusion is that the AR model does not 
provide useful prior knowledge for the input flow. 


9.3.3 Particle filtering 


A visual inspection of the estimated input flow in Figure 9.15 indeed 
reveals that an AR model does not fit well. The characteristic of the 
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input flow is more like that of a random binary signal modelled by 
two discrete states {dmin,dmax}, and a transition probability 
P,(qo(t)|qo(i — 1)); see Section 4.3.1. The estimated flow in Figure 9.15 
also suggests that qmin = 0. This actually models an on/off control 
mechanism of the input flow. With this assumption, the transition 
probability is fully defined by two probabilities: 


Pup =P +(40 (i) = Amax|qo(é a 1) = 0) (9.30) 
Pdown = P(q o(i) = a O|qo(i 73 1) = z qmax) 
The unknown parameters of the new flow model are qmax, Pup and 
Pdown. These parameters must be retrieved from the data by optimizing 
some consistency criterion of an estimator. Kalman filters cannot cope 
with binary signals. Therefore, we now focus on particle filtering, 
because this method is able to handle discrete variables (see Section 4.4). 
Again, the strategy is to realize a MATLAB function J = fpf (y) that 
implements a particle filter, applies it to the data and returns a criterion J. 
The input argument y contains the design parameters dmax, Pup and 
Paown: The output J must express how well the result of the particle filter 
is consistent. Minimization of this criterion gives the best attainable result. 
An important issue is how to define the criterion J. The NIS, which 
was previously used, exists within the framework of Kalman filters, i.e. 
for linear-Gaussian systems. It is not trivial to find a concept within the 
framework of particle filters that is equivalent to the NIS. The general 
idea is as follows. Suppose that, using all previous measurement Z(i — 1) 
up to time i—1, the probability density of the state x(i) is 
x(i)|Z(i — 1)). Then, the probability of z(ż) is 


x (9.31) 


The density p(z(i)|x(z)) is simply im model of the sensory system and as 
such known. The probability p(x(i)|Z(i — 1)) is represented n the pre- 
dicted samples. Therefore, using (9.31) the probability p(z(i)|Z(i— 1)) 
can be calculated. The filter is consistent only if the sequence of observed 
eee z(i) obeys the statistics prescribed by the sequence of 
densities p(z(i)|Z(i — 1)). 

A test of ene all z(7) comply with p(z(i)|Z(i — 1)) is not easy, because 

z(i)|Z(i — 1)) depends on i. The problem will be tackled by treating each 
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scalar measurement separately. We consider the n-th element z,(i) of the 
measurement vector, and assume that p,,(z, i) is its hypothesized marginal 
probability density. Suppose that the cumulative distribution of z,(i) is 
F,,(z,i) = JŽ ~ pu(G,i)d¢. Then the random variable u,(i) = F,(zn(é), i) has 
a uniform distribution between 0 and 1. The consistency check boils down 
to testing whether the set {u,,(i)|i = 0,...,[ — 1 and n = 1,...,N} indeed 
has such a uniform distribution. 

In the literature, the statistical test of whether a set of variables has a 
given distribution function is called a goodness-of-fit test. There are 
various methods for performing such a test. A particular one is the chi- 
square test and is as follows (Kreyszig, 1970): 


Algorithm 9.1: Goodness of fit (Chi-square test for the uniform 
distribution) 

Input: a set of variables that are within the interval [0, 1]. The size of the 
set is B. 


1. Divide the interval [0,1] into a number of L equally spaced contain- 
ers and count the number of times that the random variable falls in 
the ¢-th container. Denote this count by by. Thus, X$; be = B 
(Note: L must be such that b; > 5 for each £.) 

2. Set e = B/L. This is the expected number of variables in one container. 

. Calculate the test variable J = Sv), bee)" eÈ, 

4. If the set is uniformly distributed, then J has XZ; distribution. 
Thus, if J is too large to be compatible with the x7_, distribution, the 
hypothesis of a uniform distribution must be rejected. For instance, if 
L = 20, the probability that J > 43.8 equals 0.001. 


Ww 


The particular particle filter that was implemented is based on the 
condensation algorithm described in Section 4.4.3. The MATLAB code 
is given in Listing 9.10. Some details of the implementation follow next. 
See Algorithm 4.4. 


e The prior knowledge that we assume is that at i = 0 both tanks are 
empty, that is x1(0) = x2(0) = 0. The probability that the input 
flow is ‘on’ or ‘off? is 50/50. 

° He a aes the importance weights should be set to 
w'*) = p(z(i)|x")). Assuming a Gaussian distribution of the meas- 
PURA errors, ihe weights are calculated as w'*) = exp (— 5 (z(i) 

(Cs '(z(i) — x'*))). The normalizing constant of the Gaussian 
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can be ignored because the weights are going to be normalized 
anyway. 

‘Finding the smallest j such that w"),, > r'*)” is implemented by the 
bisection method of finding a root using the golden rule. 

In the prediction step, ‘finding the samples x% drawn from the 
density p(x(i)|x(i— 1) = ee a)? is done by generating random 
input flow transitions from ‘on’ to ‘off? and from ‘off? to ‘on’ 
according to the transition probabilities Pdown and Pp, and to 
apply these to the state equation (implemented by 
f (Ys, R1,R2)). However, the time-discrete model only allows 
the transitions to occur at exactly the sampling periods t; = iA. 
In the real time-continuous world, the transitions can take place 
anywhere in the time interval between two sampling periods. In 
order to account for this effect, the level of the first tank is 
randomized with a correction term randomly selected between 0 
and qmaxA/C. Such a correction is only needed for samples where 
a transition takes place. 

The conditional mean, approximated by &(i) = Y wx% / 5 w), 
is used as the final estimate. k k 

The probability density p(z(i)|Z(i— 1)) is represented by samples 
z'*), which are derived from the predicted samples x*) according to 
the model z% = Hx% +v% where v'*) is Gaussian noise with 
covariance matrix C,. The marginal probabilities p,(z, i) are then 
represented by the scalar samples z‘*). The test variables u„(i) are 
simply obtained by counting the number of samples for which 
z*) < z,(i) and dividing it by K, the total number of samples. 


Listing 9.10 
MATLAB listing of a function implementing the condensation algorithm. 


function J=fpf (y) 


load 


I=length (Z); 


oe 


hyddata.mat; Load the dataset (measurements in Z) 


Length of the sequence 


ae 





R1=105.78; % Friction constant (cm*5/s%*2) 
R2= 84.532; % Friction constant (cm*5/s%*2) 
qmin=0; % Minimal input flow 

delta=5; % Sampling period (s) 

C=420; % Capacity of tank (cm*2) 


sigma_v=0.04; 
Ncond=1000; 


M=3; 
N=2: 


ae 


Standard deviation sensor noise (cm) 
Number of particles 


ae 


ae 


Dimension of state vector 
% Dimension of measurement vector 
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% Set the design parameters 

qmax=y (1); % Maximum input flow 

Pup=y (2); % Transition probability up 
Pdn=y(3); % Transition probability down 





= 


% Initialisation 

hmax= 0; hmin= 0; 

H=eye(N,M); 
invCv=inv(sigma_v*2 * eye(N)); 


Margins of levelsat i=0 
Measurement matrix 
Jy 


ae dP Æ 


nv. cov. of sensor noise 


% Generate the samples 





Xs(1,:) =hmin-+ (hmax—hmin) *rand(1,Ncond) ; 
Xs(2,:) =hmin-+ (hmax—hmin) *rand(1,Ncond) ; 
Xs (3,:) =qmin+ (qmax—qmin) * (rand(1,Ncond) > 0.5); 


fora htt 
% Generate predicted meas. representing p(z(i)|Z(i-1)) 
ZS=H*Xs+sigma_v*randn(2,Ncond) ; 


% Get uniform distributed rv 
u(1,i) =sum((Zs(1,:) <2(1,1)))/Ncond; 


u(2,i) =sum((Zs(2,:) <2(2,1)))/Ncond; 

% Update 

res=H*Xs — Z(:,1) *ones(1,Ncond) ; % Residuals 
W=exp(—0.5*sum(res.*(invCv*res)))’; %Weights 

if (sum(W) ==0), error(’process did not converge’); end 
W=W/sum(W) ; CumW=cumsum (W) ; 

xest(:,1) =Xs(:,:) *W; % Sample mean 


% Find an index permutation using golden rule root finding 
for j =1:Ncond 
R=rand; ja=1; jb=Ncond; 
while (ja < jb-1) 
jx= floor (jb—0.382* (jb—ja)); 
fa=R-—CumW(ja); fb=R-— CumW (jb); fxx=R—CumW(jx); 
if (fb*fxx <0), ja=jx; else, jb=jx; end 
end 
ind(j) =jb; 
end 





% Resample 
for j=1:Ncond, Ys(:,j) =Xs(:,ind(j)); end 


% Predict 
Tdn= (rand(1,Ncond) <Pdn) ; % Random transitions 
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idem 

Samples going down 
Samples going up 
Turn input flow off 


Tup = (rand(1,Ncond) <Pup) ; 
kdn=find((Ys(3,:) ==qmax) & Tdn); 
kup = find((Ys(3,:) ==qmin) & Tup); 
s(3,kdn) =qmin; 
s(1,kdn) =Ys(1,kdn) +... 
(qmax —qmin) *delta*rand(1,length(kdn) )/C; 
Ys(3,kup) =qmax; Turn input flowon 
s(1,kup) =Ys(1,kup) = s: 
(qmax—qmin) *delta*rand(1,length(kup) )/C; 
Xs=f(¥s,R1,R2)}7 Update samples 





ae dP Æ 


ae Æ 


Randomize level 1 





ae 


ae 


Randomize level 1 


ae 





end 

e=I1/10; % Expected number of rv in 
one bin 

% Get histograms (10 bins) and calculate test variables 

fori=1:2, b=hist(u(i,:)); c(i) =sum((b—-e) .*%2/e); end 

J=sum(c); % Full test (chi-square 
with 19 DoF) 

return 


Figure 9.16 shows the results obtained using particle filtering. With the 
number of samples set to 1000, minimization of J with respect to 
Amaxs Pup and Pdown yielded a value of about J = 1400; a clear indication 
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Figure 9.16 Results from particle filtering 
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that the test variables u(i) and u(i) are not uniformly distributed 
because J should have a y{o-distribution. The graph in Figure 9.16, 
showing these test variables, confirms this statement. The values of Pup 
and Pdown that provided the minimum were about 0.3. This minimum 
was very flat though. 

The large transition probabilities that are needed to minimize the criter- 
ion indicate that we still have not found an appropriate model for the input 
flow. After all, large transition probabilities mean that the particle filter 
has much freedom of selecting an input flow that fits with the data. 


9.3.4 Discussion 


Up to now we have not succeeded in constructing a consistent estimator. 
The linearized Kalman filter failed because of the nonlinearity of the 
system, which is simply too severe. The extended Kalman filter was an 
improvement, but its behaviour was still not regular. The filter could 
only be put to action if the model for the input flow was unrestrictive, 
but the NIS of the filter was still too large. The particle filter that we 
tested used quite another model for the input flow than the extended 
Kalman filter, but yet this filter also needed an unrestrictive model for 
the input flow. 

At this point, we must face another possibility. We have adopted a 
linear model for the measurements, but the specifications of the sensors 
mention linearity and hysteresis errors e(7) which can — when measured 
over the full-scale span — become quite large. Until now we have ignored 
the error e(i) in (9.24), but without really demonstrating that doing so is 
allowed. If it is not, that would explain why both the extended Kalman 
filter and the particle filter only work out with unrestrictive models for 
the input signal. In trying to get estimates that fit the data from both 
sensors the estimators cannot handle any further restrictions on the input 
signal. 

The best way to cope with the existence of linearity and hysteresis errors 
is to model them properly. A linearity error — if reproducible — can be 
compensated by a calibration curve. A hysteresis error is more difficult to 
compensate because it depends on the dynamics of the states. In the 
present case, a calibration curve can be deduced for the second sensor 
using the previous results. In the particle filter of Section 9.3.3 the estimate 
of the first level is obtained by closely following the measurements in the 
first tank. The estimates of the second level are obtained merely by using 
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the model of the hydraulic system. Therefore, these estimates can be used 
as a reference for the measurements in the second tank. 

Figure 9.17 shows a scatter diagram of the errors versus estimated 
levels of the second tank. Of course, the data is contaminated by mea- 
surement noise, but nevertheless the trend is appreciable. Using 
MATLAB’s polyfit() and polyval() polynomial regression has 
been applied to find a curve that fits the data. The resulting polynomial 
model is used to compensate the linearity errors of the second sensor. 
The results, illustrated in Figure 9.18, show test variables that are much 
more uniformly distributed than the same variables in Figure 9.16. The 
chi-square test gives a value of J = 120. This is a significant improve- 
ment, but nevertheless still too much for a consistent filter. However, 
Figure 9.18 only goes to show that the compensation of the linearity 
errors of the sensors do have a large impact. 

For the final design, the errors of the first sensor should also be 
compensated. Additional measurements are needed to obtain the cali- 
bration curve for that sensor. Once this curve is available, the design 
procedure as described above starts over again, but this time with 
compensation of linearity errors included. There is no need to reconsider 
the linearized Kalman filter because we have already seen that its linear- 
ization errors are too severe. 
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Figure 9.17 Calibration curve for the second sensor 
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Figure 9.18 Results from particle filtering after applying a linearity correction 
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Appendix A 


Topics Selected from 
Functional Analysis 


This appendix summarizes some concepts from functional analysis. The 
concepts are part of the mathematical background required for under- 
standing this book. Mathematical peculiarities not relevant in this 
context are omitted. Instead, at the end of the appendix references to 
more detailed treatments are given. 


A.1 LINEAR SPACES 


A linear space (or vector space) over a field F is a set R with elements 
(vectors) f,g,h,... equipped with two operations: 


e addition (f + g): RxR—R 
e scalar multiplication (af with a € F): Fx R —> R 


Usually, the field F is the set R of real numbers, or the set C of complex 
numbers. The addition and the multiplication operation must satisfy the 
following axioms: 





Classification, Parameter Estimation and State Estimation: An Engineering Approach using MATLAB 
F. van der Heijden, R.P.W. Duin, D. de Ridder and D.M.J. Tax 
© 2004 John Wiley & Sons, Ltd ISBN: 0-470-09013-8 


354 APPENDIX A 





(c) a so-called zero element 0 € R exists such that f +0 =f 

(d) a negative element —f exists for each f such that f + ( — f) = 0 
(e) a(f + g) = af + ag 

(f) (a + Bf = af + GF 

(g) (ab)f = a(pf) 

(h) 1f =f 


A linear subspace S of a linear space R is a subset of R which itself is 
linear. A condition sufficient and necessary for a subset S$ C R to be 
linear is that af + 6g € S for all f, g € S and for all a, 8 € F. 


Examples 

C[a,b] is the set of all complex functions f(x) continuous in the 
interval [a,b]. With the usual definition of addition and scalar multi- 
plication this set is a linear space." The set of polynomials of degree N: 


f(x) = co + cax + cox? +- Henx with ca eC 





is a linear subspace of C[a,b]. 

The set R” consisting of an infinite, countable series of real num- 
bers f = (fo, f1,--.) is a linear space provided that the addition and 
multiplication takes place element by element. The subset of R” that 
satisfies the convergence criterion: 


o 2 
Se 
n=0 


is a linear subspace of R”. 

The set R™ consisting of N real numbers f = (fo, fi,...fx—1) is a 
linear space provided that the addition and multiplication takes place 
element by element. Any linear hyperplane containing the null vector 
(zero element) is a linear subspace. 





1 Throughout this appendix the examples relate to vectors which are either real or complex. 
However, these examples can be converted easily from real to complex or vice versa. The set of 
all real functions continuous in the interval [a,b] is denoted by R[a,b]. The set of infinite and 
finite countable complex numbers is denoted by C® and CN, respectively. 
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Any vector that can be written as: 
n—1 
f= So aif a; E F (a.1) 
i=0 


is called a linear combination of fo, f1,...,£,,-1. The vectors fo, f1,..., fm—1 
are linear dependent if a set of numbers (3; exists, not all zero, for which 
the following equation holds: 


bD Gf; = 0 (a.2) 


If no such set exists, the vectors fo, f1,...,fm—1 are said to be linear 
independent. 

The dimension of a linear space R is defined as the non-negative integer 
number N for which N independent linear vectors in R exist, while any set 
of N + 1 vectors in R is linear dependent. If for each number N there exist 
N linear independent vectors in R, the dimension of R is co. 


Example 
C[a,b] and R® have dimension oo. R has dimension N. 


A.1.1 Normed linear spaces 


A norm ||f|| of a linear space is a mapping R — R (i.e. a real function or a 
functional) such that: 


(a) ||f|| > 0, where ||f|| = 0 if and only if f = 0 
(b) llafi] = Jall] 
(c) |lf+ sll < IfI] + Isl 


A linear space equipped with a norm is called a normed linear space. 


Examples 


The following real functions satisfy the axioms of a norm: 
In C[a,b]: 


1 


b p 
Ifl = ( S. foras) with: p > 1 (a3) 
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In R”: 


œ% p 
lIfll, = (>: i) with: p > 1 (a.4) 
n=0 


In RY: 


N-1 7 
lIfll, = È i) with: p > 1 (a.5) 
n=0 


These norms are called the Lp norm (continuous case, e.g. C[a,b]) and the 
l, norm (discrete case, e.g. C°). Graphical representations of the norm are 
depicted in Figure A.1. Special cases occur for particular choices of the 
parameter p. If p = 1, the norm is the sum of the absolute magnitudes, e.g. 


b 
feo = f \F(x)\dx and fli = So |fal (a.6) 


If p = 2, we have the Euclidean norm: 


b 
IF @)llo = vf f(x) dx and fll, = f° fal? (a.7) 


In R? and R? the norm ||f||, is the length of the vector as defined in 
geometry. If p — oo, the Lp and l, norms are the maxima of the absolute 
magnitudes, e.g. 


IIf(x)|]0 = max |f(x)| and fllo = max |fn| (a.8) 


x€{a,b] 


0B 


Figure A.1 ‘Circles’ in R? equipped with the J, norm 
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A.1.2 Euclidean spaces or inner product spaces 


A Euclidean space (also called inner product space) R is a linear space 
for which the inner product is defined. The inner product (f, g) between 
two vectors f,g € R over a field Fis a mapping R x R — F that satisfies 
the following axioms: 





(a) (f+ g.h) = (f,h) + (g,h) 
(b) (af,g) = a(f,g) 

(c) (g,f) = (f,g) 

(d) (f,f£) > 0, real 

( 





In (c) the number (f, g) is the complex conjugated of (f, g). Of course, this 
makes sense only if F is complex. If F is the set of real numbers, then 
(g, f) = (f, g). 


Examples 
C[a,b] is a Euclidean space if the inner product is defined as: 


b <n 
(F(x).e(%)) | Fd 
R” is a Euclidean space if with f = (fo, fi, ...) and g = (g0, g1,..-): 
g) = X` fai 
n=0 


Rï is a Euclidean space if with f= (fo,fi,...,fy-1) and 
8 = (80, 81,--->8N-1): 


N-1 
(g) = fig: 
n=0 


In accordance with (a.7), any Euclidean space becomes a normed linear 
space as soon as it is equipped with the Euclidean norm: 


IfI = +y (£, £) (a.9) 
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Given this norm, any two vectors f and g satisfy the Schwarz inequality: 


I(f,8)| < |lflllisl (a.10) 


Two vectors f and g are said to be orthogonal (notation: fg) whenever 
(f.g) = 0. If two vectors f and g are orthogonal then (Pythagoras): 


lf + sll” = Ifl? + Iigll? (a.11) 


Given a vector g Æ 0, an arbitrary vector f can be decomposed into a 
component that coincides with g and an orthogonal component: 
= ag +h with gh. The scalar a follows from: 


€g) €g) 
(88) Iig? on 





The term qag is called the projection of f on g. The angle p between two 
vectors f and g is defined such that it satisfies: 





(f,g) 
coslo) = TFN] EA 


A.2 METRIC SPACES 


A metric space R is a set equipped with a distance measure p(f,g) 
that maps any couple of elements f,g € R into a non-negative real 
number: R x R — R*. The distance measure must satisfy the following 
axioms: 


a) p(f,g) = 0 if and only if f = 
(b) p(f,g) = p(g,f) 
(c) p(f,h) < p(f,g) + p(g,h) 


All normed linear spaces become a metric space, if we set: 


p£, g) = |f- gll (a.14) 
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Consequently, the following mappings satisfy the axioms of a distance 
measure. In C[a,b]: 


1 


b p 
afoso ( fs) = oP) with: p>1 (a.15) 


p(f,g) = (>: [fa - el with: p21 (a.16) 
n=0 


In RY: 


N=1 5 
plf, g) = S [fa - a with: p21 (a.17) 
n=0 


These are the Minkowski distances. A theorem related to these measures 
is Minkowski’s inequality. In RN and C® this equality states that:? 


(= ftat) < (= er) il") (a.18) 


Special cases occur for particular choices of the parameter p. If p = 1, 
the distance measure equals the sum of absolute differences between the 
various coefficients, e.g. 


b 
Aese) = f If) — g(x)|dx and p(f,8) = So Ifa gl 
(a.19) 


This measure is called the city-block distance (also known as Manhattan, 
magnitude, box-car or absolute value distance). If p = 2, we have the 
ordinary Euclidean distance measure. If p — co the maximum of the 
absolute differences between the various coefficients is determined, e.g. 





? In C[a,b] the inequality is similar. 
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e(f (x), 8(x)) = ees f(x) — g(x)| and p(f,g) = max |fn — gn 
(a.20) 


This measure is the chessboard distance (also known as Chebyshev or 
maximum value distance). 
The quadratic measure is defined as: 


(fg) = +/(f — g, Af — Ag) (a.21) 


where A is a self-adjoint operator with non-negative eigenvalues. This 
topic will be discussed in Section A.4 and Section B.5. This measure finds 
its application in, for instance, pattern classification (where it is called 
the Mahalanobis distance). 

Another distance measure is: 


ael eg’ (a.22) 


An application of this measure is in Bayesian estimation and classifica- 
tion theory where it is used to express a particular cost function. Note 
that in contrast with the preceding examples this measure cannot be 
derived from a norm. For every metric derived from a norm we have 
plaf,0) = |alp(f,0). However, this equality does not hold for (a.22). 


A.3 ORTHONORMAL SYSTEMS AND FOURIER 
SERIES 


In a Euclidean space R with the norm given by (a.9), a subset S C R is an 
orthogonal system if every couple b;,b; in S is orthogonal; i.e. (b;,b;) = 0 
whenever i Æj. If in addition each vector in S has unit length, i.e. 
||b;|| = 1, then S is called an orthonormal system. 


Examples 
In C[a,b] the following harmonic functions form an orthonormal 
system: 





W(x) = : exp (=) with: j= v—1 (a.23) 
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In C™ the following vectors are an orthonormal system: 


ons exp (271)... Le op (28i(N = Dm (a.24) 
mye) ya a) 


Let S={wo wi...wn-_1} be an orthonormal system in a Euclidean 
space R, and let f be an arbitrary vector in R. Then the Fourier coeffi- 
cients of f with respect to S are the inner products: 





op = (f, we) k=0,1,...,N-1 (a.25) 


Furthermore, the series: 


N-1 
XO ewe (a.26) 
k=0 


is called the Fourier series of f with respect to the system S. Suppose we 
wish to approximate f by a suitably chosen linear combination of the 
system S. The best approximation (according to the norm in R) is given 
by (a.26). This follows from the following inequality: 


Le yom 


for arbitrary Bk (a.27) 


l-5 = 7 PRWe 




















The approximation improves as the number N of vectors increases. This 
follows readily from Bessel’s inequality: 


N-1 j 3 
S lal < Ifl (a.28) 
k=0 


Let S = {wo wi ...} be an orthonormal system in a Euclidean space R. 
The number of vectors in S may be infinite. Suppose that no vector w 
exists for which the system S augmented by w, i.e. $ ={W wo wi ...}is 
also an orthonormal system. Then, S is called an orthonormal basis. In 
that case, the smallest linear subspace containing S is the whole space R. 

The number of vectors in an orthonormal basis $ may be finite (as in 
RN and CY), countable infinite (as in R®, C®, R[a, b], and C[a, b] with 
—œ <a<b< oo), or uncountable infinite (as in R[—oo,oo], and 
C[—o«, o0)). 
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Examples 
In C[a,b] with —co < a < b < œ an orthonormal basis is given by the 
harmonic functions in (a.23). The number of such functions is count- 
able infinite. The Fourier series defined in (a.26) essentially corres- 
ponds to the Fourier series expansion of a periodic signal. In 
C[—oco,00] this expansion evolves into the Fourier integral.’ 
In C an orthonormal basis is given by the vectors in (a.24). The 
Fourier series in (a.26) is equivalent to the discrete Fourier transform. 
The examples given above are certainly not the only orthonormal 
bases. In fact, even in R (with N > 1) infinitely many orthonormal 
bases exist. 


If $ is an orthonormal basis, and f an arbitrary vector with Fourier 
coefficients dg with respect to S, then the following theorems hold: 


f= ` ewe (a.29) 
k 
D (£, we) (g, We) (Parseval) (a.30) 
k 
IfI? = `> el? (Parseval/Pythagoras) (a.31) 
k 


Equation (a.29) corresponds to the inverse transform in Fourier analysis. 
Equation (a.31) follows directly from (a.30), since ||w,|| = 1, Vk. The 
equation shows that in the case of an orthonormal basis Bessel’s inequal- 
ity transforms to an equality. 


A.4 LINEAR OPERATORS 


Given two normed linear spaces Ry and R32, a mapping A of a subset of 
Rı into R3 is called an operator from Rı to R2. The subset of Ry 
(possibly R, itself) for which the operator A is defined is called the 
domain D4 of A. The range R4 is the set {g|g = Af, f € Da}. In the 
sequel we will assume that D4 is a linear (sub)space. An operator is 
linear if for all vectors f,g € D4 and for all a and 8: 





3In fact, C[—co,00] must satisfy some conditions in order to assure the existence of the Fourier 
expansion. 
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A(af + 6g) = aAf + BAg (a.32) 


Examples 

Any orthogonal transform is a linear operation. In RN and CN 
any matrix-vector multiplication is a linear operation. In R” and 
C™ left-sided shifts, right-sided shifts and any linear combination of 
them (i.e. discrete convolution) are linear operations. In R[a,b] 
convolution integrals and differential operators are linear operators. 


Some special linear operators are: 


e The null operator 0 assigns the null vector to each vector: Of = 0. 
e The identity operator I carries each vector into itself: If = f. 


An operator A is invertible if for each g € R4 the equation g = Af has a 


unique solution f € D4. The operator A`! that uniquely assigns this 
solution f to g is called the inverse operator of A: 


g=Af s f= Ag (a.33) 
The following properties are shown easily: 
e AA =I 
e AA! =I 
e The inverse of a linear operator — if it exists — is linear. 
Suppose that in a linear space R two orthonormal bases 


Sa={a a, ---}andS,={bo bı ---} are defined. According to 
(a.29) each vector f € R has two representations: 


f= X agar with: Qk = (£, ag) 
k 


f=) Geb, with: Ak = (f, b) 
k 


Since both Fourier series represent the same vector we conclude that: 


f=X arar =X Bbg 
k k 
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The relationship between the Fourier coefficients a, and 3, can be made 
explicit by the calculation of the inner product: 


=D ol (ag, b = Aa bz, bn (a.34) 


The Fourier coefficients a, and 6p can be arranged as vectors 
æ = (a0, a1,- ) and B = (Bo, 61,- ) in Rẹ or CN (if the dimension of 
R is finite), or in R” and C®™ (if the dimension of R is infinite). In one of 
these spaces equation (a.34) defines a linear operator U: 


B=Ua (a.35) 


The inner product in (a.34) could equally well be accomplished with 
respect to a vector a,. This reveals that an operator U* exists for which: 


œ = UB (a.36) 
Clearly, from (a.33): 
U= U (a.37) 


Suppose we have two vectors f; and f, represented in S4 by a, and a, 
and in S, by B,, and B,. Since the inner product (f1,f2) must be independ- 
ent of the representation, we conclude that (f1,f2) = (œ1,@2) = (81,82). 
Therefore: 


(a1, U"B,) = (Ua, B,) (a.38) 


Each operator that satisfies (a.38) is called a unitary operator. A corollary 
of (a.38) is that any unitary operator preserves the Euclidean norm. 
The adjoint A* of an operator A is an operator that satisfies: 


(Af, g) = (f, A*g) (a.39) 


From this definition, and from (a.38), it follows that an operator U for 
which its adjoint U* equals its inverse U~! is a unitary operator. This is 
in accordance with the notation used in (a.37). An operator A is called 
self-adjoint, if A* = A 

Suppose that A is a linear operator in a space R. A vector e that 
satisfies: 
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Ae, = App ep £ 0 (a.40) 


with Az a real or complex number is called an eigenvector of A. The 
number A, is the eigenvalue. The eigenvectors and eigenvalues of an 
operator are found by solving the equation (A — AzI)e, = 0. Note that if 
ez is a solution of this equation, then so is œe with a any real or complex 
number. If a unique solution is required, we should constrain the length 
of the eigenvector to unit, i.e. vz = eg/||eg||, yielding the so-called nor- 
malized eigenvector. However, since +e,/|leg|| and —eg/||eg|| are both 
valid eigenvectors, we still have to select one out of the two possible 
solutions. From now on, the phrase ‘the normalized eigenvector’ will 
denote both solutions. 

Operators that are self-adjoint have — under mild conditions — some 
nice properties related to their eigenvectors and eigenvalues. The proper- 
ties relevant in our case are: 


1. All eigenvalues are real. 

2. With each eigenvalue at least one normalized eigenvector is asso- 
ciated. However, an eigenvalue can also have multiple normalized 
eigenvectors. These eigenvectors span a linear subspace. 

3. There is an orthonormal basis V = {vo vı ---} formed by the 
normalized eigenvectors. Due to possible multiplicities of normal- 
ized eigenvalues (see above) this basis may not be unique. 


A corollary of the properties is that any vector f € R can be represented 


by a Fourier series with respect to V, and that in this representation the 
operation becomes simply a linear combination, that is: 


f= deve with: oy = (f, ve) (a.41) 
k 


Af = y ApPeVe (a.42) 
k 


The connotation of this decomposition of the operation is depicted in 
Figure A.2. The set of eigenvalues is called the spectrum of the operator. 
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Figure A.2 Eigenvalue decomposition of a self-adjoint operator 
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Appendix B 


Topics Selected from Linear 
Algebra and Matrix Theory 


Whereas Appendix A deals with general linear spaces and linear oper- 
ators, the current appendix restricts the attention to linear spaces with 
finite dimension, i.e. RN and CN. With that, all that has been said in 
Appendix A also holds true for the topics of this appendix. 


B.1 VECTORS AND MATRICES 


Vectors in RN and CN are denoted by bold-faced letters, e.g. f, g. The 
elements in a vector are arranged either vertically (a column vector) or 
horizontally (a row vector). For example: 


fo 
r= : or: f =([f fi = fral (b.1) 
fn-1 


The superscript T is used to convert column vectors to row vectors. 

Vector addition and scalar multiplication are defined as in Section A.1. 
A matrix H with dimension N x M is an arrangement of NM numbers 

bnm (the elements) on an orthogonal grid of N rows and M columns: 





Classification, Parameter Estimation and State Estimation: An Engineering Approach using MATLAB 
F. van der Heijden, R.P.W. Duin, D. de Ridder and D.M.J. Tax 
© 2004 John Wiley & Sons, Ltd ISBN: 0-470-09013-8 
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hoo boa +++ bom-1 
ho ha © bhima 
H=| ho: (b.2) 
hbn-10 © +++) PN-1,M-1 


The elements are real or complex. Vectors can be regarded as N x 1 
matrices (column vectors) or 1 x M matrices (row vectors). A matrix can 
be regarded as an horizontal arrangement of M column vectors with 
dimension N, for example: 


H= (hy h os hma] (b.3) 


Of course, a matrix can also be regarded as a vertical arrangement of N 
row vectors. 

The scalar—-matrix multiplication aH replaces each element in H with 
abam. The matrix-addition H = A+B is only defined if the two 
matrices A and B have equal size N x M. The result H is an N x M 
matrix with elements hnm = anm + bam. These two operations satisfy 
the axioms of a linear space (Section A.1). Therefore, the set of all 
N x M matrices is another example of a linear space. 

The matrix-matrix product H = AB is defined only when the number 
of columns of A equals the number of rows of B. Suppose that A is an 
N x P matrix, and that B is a P x M matrix, then the product H = AB is 
an N x M matrix with elements: 


P-1 
bnm = `, anpbpm (b.4) 
p=0 


Since a vector can be regarded as an N x 1 matrix, this also defines the 
matrix-vector product g = Hf with f an M-dimensional column vector, 
H an N x M matrix and g an N-dimensional column vector. In accord- 
ance with these definitions, the inner product between two real 
N-dimensional vectors introduced in Section A.1.2 can be written as: 


N-1 
n=0 
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It is easy to show that a matrix—vector product g = Hf defines a linear 
operator from R™ into RN and C™ into CN. Therefore, all definitions 
and properties related to linear operators (Section A.4) also apply to 
matrices. 

Some special matrices are: 


e The null matrix O. This is a matrix fully filled with zero. It corres- 
ponds to the null operator: Of = 0. 

e The unit matrix I. This matrix is square (N = M), fully filled with 
zero, except for the diagonal elements which are unit: 


1 0 
0 1 


This matrix corresponds to the unit operator: If = f. 
e A diagonal matrix A is a square matrix, fully filled with zero, except 
for its diagonal elements Ann: 


0 AN-1,N-1 


Often, diagonal matrices are denoted by upper case Greek symbols. 

e The transposed matrix H! of an N x M matrix H is an M x N 
matrix, its elements are given by hl „ = bnm- 

e A symmetric matrix is a square matrix for which H” = H. 

e The conjugated of a matrix H is a matrix H the elements of which 
are the complex conjugated of the one of H. 

e The adjoint of a matrix H is a matrix H* which is the conjugated 
and the transposed of H, that is: H* = H’. A matrix H is self- 
adjoint or Hermitian if H* = H. This is the case only if H is square 
and hnm = bmn. 

e The inverse of a square matrix H is the matrix H™! that satisfies 
H`'H = L. If it exists, it is unique. In that case the matrix H is called 
regular. If H~' doesn’t exist, H is called singular. 

e A unitary matrix U is a square matrix that satisfies U`! = U*. 
A real unitary matrix is called orthonormal. These matrices satisfy 
ie Ue, 
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e A square matrix H is Toeplitz if its elements satisfy han = 2(n—m) in 


which g, is a sequence of 2N — 1 numbers. 
e A square matrix H is circulant if its 


satisfy 


bnm = &(n-my%N- Here, (n — m)%N is the remainder of (n — m)/N. 
e A matrix H is separable if it can be written as the product of two 


vectors: H = fg”. 


Some properties with respect to the matrices mentioned above: 


(H')* =H (b.6) 

(AB)" = B*A’ (b.7) 

(HY = (AY) (b.8) 

(AB) '=B'A! (b.9) 

(A`! + HBH)! = A — AH? (HAH™ at B) "HA (b.10) 


The relations hold if the size of the matrices are compatible and the 
inverses exist. Property (b.10) is known as the matrix inversion lemma. 


B.2 CONVOLUTION 


Defined in a finite interval, the discrete convolution between a sequence 
fk and gp: 


N-1 
gn = > bnef, with: n=0,1,...,N-1 (b.11) 
k=0 


can be written economically as a matrix-vector product g = Hf. The 
matrix H is a Toeplitz matrix: 


ho ba hz hin 
hy ho ba hyn 
. : . ie 


CONVOLUTION 371 


If this ‘finite interval’ convolution is replaced with a circulant (wrap- 
around) discrete convolution, then (b.11) becomes: 


N-1 


En = So binant with: n= 0,1,...,N-—1 (b.13) 
k=0 


In that case, the matrix—vector relation g = Hf still holds. However, the 
matrix H is now a circulant matrix: 


ho bna bno © h 

hy ho hbn-1 hy 

: : hn-1 
hn-1 bna >> hy ho 


The Fourier matrix W is a unitary N x N matrix with elements given by: 


Wim = 





Ko En) with: j=V—1 (b.15) 
The (row) vectors in this matrix are the complex conjugated of the 
basisvectors given in (a.24). Note that W* = W~! because W is unitary. 
It can be shown that the circulant convolution in (b.13) can be trans- 
formed into an element-by-element multiplication provided that the 
vector g is represented by the orthonormal basis of (a.24). In this 
representation the circulant convolution g = Hf becomes; see (a.36): 


W'g = W'Hf (b.16) 


Writing W=[wo wi ... wyn-1] and carefully examining Hw, 
reveals that the basisvectors w, are the eigenvectors of the circulant 
matrix H. Therefore, we may write Hw, = Aw}, with 
k=0,1,...,N—1. The numbers A; are the eigenvalues of H. If these 
eigenvalues are arranged in a diagonal matrix: 


Ao 0 
A= A (b.17) 
0 AN-1 
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Figure B.1 Discrete circulant convolution accomplished in the Fourier domain 


the N equations Hw, = Aw, can be written economically as: 
HW = WA. Right-sided multiplication of this equation by W~! yields: 
H = WAW |. Substitution in (b.16) gives: 


Wg = W*WAW 'f = AW !f (b.18) 


Note that the multiplication A with the vector W~'f is an element-by- 
element multiplication because the matrix A is diagonal. The final result 
is obtained if we perform a left-sided multiplication in (b.18) by W: 


g = WAW 'f (b.19) 


The interpretation of this result is depicted in Figure B.1. 


B.3 TRACE AND DETERMINANT 


The trace trace(H) of a square matrix H is the sum of its diagonal elements: 


N-1 
trace(H) = X` ban (b.20) 


n=0 


Properties related to the trace are (A and B are N x N matrices, f and g 
are N-dimensional vectors): 


trace(AB) = trace(BA) (b.21) 
(f,g) = fg = trace(fg*) (b.22) 


The determinant |H| of a square matrix H is recursively defined with its 
co-matrices. The co-matrix H,,m is an (N — 1) x (N — 1) matrix that is 
derived from H by exclusion of the n-th row and the m-th column. The 
following equations define the determinant: 


If N=1: |H|=ho,0 


N-1 b.23 
f N>1: IH] = X` (—1)”ho,m|Ho, ml B 


m=0 
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Some properties related to the determinant: 





|AB| = |A||B| (b.24) 
4 1 

|A | Al (b.25) 

|A| = |A| (b.26) 

U is unitary: > |U| = 1 (b.27) 
N-1 

A is diagonal: > |A| = Ann (b.28) 
n=0 


If \,, are the eigenvalues of a square matrix A, then: 


N-1 N-1 
trace(A) = > An and |A| = II An (b.29) 
n=0 


n= 


The rank of a matrix is the maximum number of column vectors (or row 
vectors) that are linearly independent. The rank of a regular N x N 
matrix is always N. In that case, the determinant is always non-zero. 
The reverse holds true too. The rank of a singular N x N matrix is 
always less than N, and the determinant is zero. 


B.4 DIFFERENTIATION OF VECTOR AND MATRIX 
FUNCTIONS 


Suppose f(x) is a real or complex function of the real N-dimensional 
vector x. Then, the first derivative of f(x) with respect to x is an 
N-dimensional vector function (the gradient): 


ane) with elements: OHS) 
Ox OXn 





(b.30) 


If f(x) = ax (i.e. the inner product between x and a real vector a), then: 





=a (b.31) 
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Likewise, if f(x) = x’ Hx (i.e. a quadratic form defined by the matrix H), 
then: 


O[x Hx] 
ox 





= 2Hx (b.32) 


The second derivative of f(x) with respect to x isan N x N matrix called 
the Hessian matrix: 


O°f (x) 


Ox2 


f(x) (b.33) 


Ae) = ~ Ox,0x 


with elements: —y, (x) 





The determinant of this matrix is called the Hessian. 
The Jacobian matrix of an N-dimensional vector function f(): 
R™ — RN is defined as an N x M matrix: 


H(x) = See with elements: hn m(x) = a (b.34) 


Its determinant (only defined if the Jacobian matrix is square) is called 
the Jacobian. 

The differentiation of a function of a matrix, e.g. f (H), with respect to this 
matrix is defined similar to the differentiation in (b.30). The result is a matrix: 


fH) __. _ Of(A) 
aH with elements: asm 





(b.35) 


Suppose that we have a square, invertible R x R matrix function F(H) of 
an N x M matrix H, that is F(): RN x R“ — RÈ x RÈ, then the deriva- 
tive of F(H) with respect to one of the elements of H is: 


gD <a forall) 
-Cpp . (b.36) 
Ohnm ə : 5 : 

dhrn P e dhn, fr-1.R-1(H) 
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From this the following rules can be derived: 

















Pam [F(H) + G(H)] = = ed F(H) +z Fe G(H) (b.37) 
he [F(H)G(H)} = f hs rœ) G(H) + Er cc] F(H)  (b.38) 
hen F-'(H) = -F (H) Da FH) F-'(H) (b.39) 


Suppose that A, B and C are square matrices of equal size. Then, 
some properties related to the derivatives of the trace and the 
determinant are: 


0 

za trace(A) =I (b.40) 
© _plct 
5a Lrace(BAC) =BC (b.41) 
A trace(ABA") = A(B +B’) (b.42) 
o 3 “it 
za BACI = |BAC|(A` ) (b.43) 

0 = =f 


In (b.44), [At], n is the m, n-th element of A~. 


B.S DIAGONALIZATION OF SELF-ADJOINT 
MATRICES 


Recall from Section B.1 that a N x N matrix H is called self-adjoint 
or Hermitian if H* =H. From the discussion on self-adjoint 
operators in Section A.4 it is clear that associated with H, there 
exists an orthonormal basis V ={vọ vı ... VnN-1} which we 
now arrange in a unitary matrix V=[vo vı ... vn_1]. Each 
vector vz is an eigenvector with corresponding (real) eigenvalue Az. 
These eigenvalues are now arranged as the diagonal elements in a 
diagonal matrix A. 
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The operation Hf can be written as; see (a.42) and (b.5): 


N-1 N-1 N-1 
Hf = > An(Vn, f)Vn = Ss" AnVnt Vn = > AnYn Vaf (b.45) 
n=0 n=0 n=0 


Suppose that the rank of H equals R. Then there are exactly R non-zero 
eigenvalues. Consequently, the number of terms in (b.45) can be 
replaced with R. From this, it follows that H is a composition of its 
eigenvectors according to: 


R-1 
H= ARVRV I (b.46) 
k=0 


The summation on the right-hand side can be written more economically 
as: VAV*. Therefore: 


H = VAV* (b.47) 


The unitary matrix V* transforms the domain of H such that H becomes 
the diagonal matrix A in this new domain. The matrix V accomplishes 
the inverse transform. In fact, (b.47) is the matrix version of the decom- 
position shown in Figure A.2. 

If the rank R equals N, there are exactly N non-zero eigenvalues. In 
that case, the matrix A is invertible, and so is H: 


N-1 
H-'=va"!v* = y 
n=0 


VaV), 
An 





(b.48) 


It can be seen that the inverse H~! is also self-adjoint. 

A self-adjoint matrix H is positive definite if its eigenvalues are all 
positive. In that case the expression p(f,g) = y (f — g)"H(f — g) satisfies 
the conditions of a distance measure (Section A.2). To show this it 
suffices to prove that Vf‘Hf satisfies the conditions of a norm; see 
Section A.2. These conditions are given in Section A.1. We use the 
diagonal form of H; see (b.47): 


fHf* = f*VAV*E (b.49) 


Since V is a unitary matrix, the vector V*f can be regarded as the 
representation of f with respect to the orthonormal basis defined by 
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the vectors in V. Let this representation be denoted by: @ = V*f. The 
expression f Hf equals: 


N-1 
FHE = 9° Ad = X Anlon]? (b.50) 
n=0 


Written in this form, it is easy to show that all conditions of a norm are met. 

With the norm Vf*Hf the sets of points equidistant to the origin, i.e. 
the vectors that satisfy f*Hf = constant, are ellipsoids. See Figure B.2. 
This follows from (b.50): 


N-1 
f'Hf = constant <= Ss" Anlon = constant 
n=0 


Hence, if we introduce a new vector u with elements defined as: 
Un = n/V Any We Must require that: 


N-1 


Ss" lin |” = constant 


n=0 


We conclude that in the u domain the ordinary Euclidean norm applies. 
Therefore, the solution space in this domain is a sphere (Figure B.2(a)). The 
operation u, = bn/Xn is merely a scaling of the axes by factors vAn. This 
transforms the sphere into an ellipsoid. The principal axes of this ellipsoid 
line up with the basisvectors u (and @), see Figure B.2(b). Finally, the 
unitary transform f = Vø rotates the principal axes, but without affecting 
the shape of the ellipsoid (Figure B.2(c)). Hence, the orientation of these 
axes in the f domain is along the directions of the eigenvectors v, of H. 

The metric expressed by \/(f — g)"H(f — g) is called is a quadratic 
distance. In pattern classification theory it is usually called the Mahala- 
nobis distance. 


oo y 


(a) u domain b) ġ domain (c) f domain 





Figure B.2 ‘Circles’ in R? equipped with the Mahalanobis distance 
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B.6 SINGULAR VALUE DECOMPOSITION (SVD) 


The singular value decomposition theorem states that an arbitrary 
N x M matrix H can be decomposed into the product: 


H = UZV" (b.51) 


where U is an orthonormal N x R matrix, © is a diagonal R x R matrix, 
and V is an orthonormal M x R matrix. R is the rank of the matrix H. 
Therefore, R < min(N,M). 

The proof of the theorem is beyond the scope of this book. However, its 
connotation can be clarified as follows. Suppose for a moment, that 
R = N = M. We consider all pairs of vectors of the form y = Hx where 
x is on the surface of the unit sphere in R™, i.e. ||x|| = 1. The corres- 
ponding y must be on the surface in R defined by the equation 
y'y = x"H" Hx. The matrix HTH is a symmetric M x M matrix. By 
virtue of (b.47) there must be an orthonormal matrix V and a diagonal 
matrix S such that:! 





H'H = VSV” (b.52) 


The matrix V=[vo --- vm-1 ] contains the (unit) eigenvectors v; of 
H"H. The corresponding eigenvalues are all on the diagonal of the 
matrix S. Without loss of generality we may assume that they are sorted 
in descending order. Thus, S;; > Sit1j41- 

With €4f VTx, the solutions of the equation yTy = x'H’Hx with 
||x|| = 1 is the same set as the solutions of yy = ETSE. Clearly, if x is 
a unit vector, then so is € because V is orthonormal. Therefore, the 


solutions of yTy = E" SE is the set: 
{y} with y= € and |g) =1 
where: 
r=$ (b.53) 


X is the matrix that is obtained by taking the square roots of all 
(diagonal) elements in S. The diagonal elements of © are usually denoted 





1 We silently assume that H is a real matrix so that V is also real, and V* = VT. 
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by the singular values oj = Xj; = +,/S;;. With that, the solutions 
become a set: 


N-1 N-1 
{y} with y= Se oj&V; and os eS (b.54) 
i=0 i=0 


Equation (b.54) reveals that the solutions of y’y = x’H!Hx with 
||x|| = 1 are formed by a scaled version of the sphere ||x|| = 1. Scaling 
takes place with a factor ø; in the direction of v;. In other words, the 
solutions form an ellipsoid. 

Each eigenvector v; gives rise to a principal axis of the ellipsoid. The 
direction of such an axis is the one of the vector Hv;. We form the matrix 
HV = [Hvo ... Hvy_i] which contains the vectors that point in the 
directions of all principal axes. Using (b.52): 


H'HV=VS = HH'HV=HVS 
Consequently, the column vectors in the matrix U that fulfils 
HHU = US (b.55) 


successively point in the directions of the principal axes. Since S is a diagonal 
matrix, we conclude from (b.47) that U must contain the eigenvectors u; of 
the matrix HH’, ie.U=[up --- uy_1]. The diagonal elements of S are 
the corresponding eigenvalues. U is an orthonormal matrix. 

The operator H maps the vector v; to a vector Hv; whose direction is 
given by the unit vector u;, and whose length is given by o;. Therefore, 
Hv; = o;u;. Representing an arbitrary vector x as a linear combination of 
v;, i.e. € = V'x gives us finally the SVD theorem stated in (b.51). 

In the general case, N can differ from M and R < min(N, M). How- 
ever, (b.51), (b.52), (b.54) and (b.55) are still valid, except that the 
dimensions of the matrices must be adapted. If R < M, then the singular 
values from or up to oy_1 are zero. These singular values and the 
associated eigenvectors are discarded then. The ellipsoid mentioned 
above becomes an R-dimensional ellipsoid that is embedded in RN. 

The process is depicted graphically in Figure B.3. The operator 
é= V'x aligns the operand to the suitable basis {vj} by means of a 
rotation. This operator also discards all components vg up to vm-1 if 
R < M. The operator u = XÉ stretches the axes formed by the {v;} basis. 
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Figure B.3 Singular value decomposition of a matrix H 


Finally, the operator y = Uu hangs out each v; component in R™ as a u; 


component in R^. 
An important application of the SVD is matrix approximation. For 
that purpose, (b.51) is written in an equivalent form: 


R-1 
H= ò ciu;v? 
i=0 


is in descending order. Also, 


(b.56) 


We recall that the sequence oo, 01,- 
ci > 0. Therefore, if the matrix must be approximated by less then R 
terms, then the best strategy is to nullify the last terms. For instance, if 


the matrix H must be approximated by a vector product, i.e. H = fe’, 
then the best approximation is obtained by nullifying all singular values 


except oo. That is H ~ aouovs. 
The SVD theorem is useful to find the pseudo inverse of a matrix: 
R-1 
1 
- (b.57) 


Ht = VEUT = X `> vu! 


The pseudo inverse gives the least squares solution to the equation 
y = Hx. In other words, the solution X= H*y is the one that minimizes 
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\|Hx — y||5. If the system is underdetermined, R < M, it provides a 
minimum norm solution. That is, X is the smallest solution of y = Hx. 

The SVD is also a good departure point for a stability analysis of the 
pseudo inverse, for instance by studying the ratio between the largest 
and smallest singular values. This ratio is a measure of the sensitivity of 
the inverse to noise in the data. If this ratio is too high, we may consider 
regularizing the pseudo inverse by discarding all terms in (b.57) whose 
singular values are too small. 

Besides matrix approximation and pseudo inverse calculation the SVD 
finds application in image restoration, image compression and image 
modelling. 
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Appendix C 


Probability Theory 


This appendix summarizes concepts from probability theory. This sum- 
mary only concerns those concepts that are part of the mathematical 
background required for understanding this book. Mathematical pecu- 
liarities which are not relevant here are omitted. At the end of the 
appendix references to detailed treatments are given. 


C.1 PROBABILITY THEORY AND RANDOM 
VARIABLES 


The axiomatic development of probability involves the definitions of 
three concepts. Taken together these concepts are called an experiment. 
The three concepts are: 


(a) A set Q consisting of outcomes wi. A trial is the act of randomly 
drawing a single outcome. Hence, each trial produces one w € 2. 
(b) A is a set of certain’ subsets of Q. 
Each subset a € A is called an event. The event {w;}, which 
consists of a single outcome, is called an elementary event. The 





' This set of subsets must comply with some rules. In fact, A must be a so-called Borel set. But 
this topic is beyond the scope of this book. 
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set 2 is called the certain event. The empty subset @ is called the 
impossible event. We say that an event a occurred if the outcome 
w of a trial is contained in a, i.e. if w € a. 

(c) A real function P(a) is defined on A. This function, called prob- 
ability, satisfies the following axioms: 

I: P(a) > 0 

I: P(Q) =1 
Ill: Ifa, @€ AandanG=O then P(aU 8) = Pla) + P(8) 


Example 

The space of outcomes corresponding to the colours of a traffic-light 
is: Q = {red, green, yellow}. The set A may consist of subsets like: 
0, red, green, yellow, red U green, red N green, red U green U yellow, .... 
With that, P(green) is the probability that the light will be green. 
P(green U yellow) is the probability that the light will be green or 
yellow or both. P(green N yellow) is the probability that at the same 
time the light is green and yellow. 


A random variable x(w) is a mapping of Q onto a set of numbers, 
for instance: integer numbers, real numbers, complex numbers, etc. 
The distribution function F,(x) is the probability of the event that 
corresponds to x < x: 


F(x) = P(x < x) (c.1) 


A note on the notation 

In the notation F,,(x) the variable x is the random variable of interest. 
The variable x is the independent variable. It can be replaced by other 
independent variables or constants. So, F,(s), F(x?) and F,(0) are all 
valid notations. However, to avoid lengthy notations the abbreviation 
F(x) will often be used if it is clear from the context that F,(x) is meant. 
Also, the underscore notation for a random variable will be omitted 
frequently. 


The random variable x is said to be discrete if a finite number (or infinite 
countable number) of events x1, x2,... exists for which: 


P(x=x;)>0 and X P(x S= (c.2) 
alli 
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Notation 
If the context is clear, the notation P(x;) will be used to denote P(x = xj). 


The random variable x is continuous if a function p(x) exists for which: 
Fax) = [pelea (e3) 


The function p,.(x) is called the probability density of x. The discrete case 
can be included in the continuous case by permitting Px(x) to contain 
Dirac functions of the type P;6(x — xj). 

Notation 

If the context is clear, the notation p(x) will be used to denote p,(x). 


Examples 
We consider the experiment consisting of tossing a (fair) coin. The 
possible outcomes are {head, tail}. The random variable x is defined 
according to: 
head — x = 0 
tail — x= 1 


This random variable is discrete. Its distribution function and prob- 
abilities are depicted in Figure C.1(a). 


(a) (b) 














Fx) 
1 
0.5 
x 
P(xi) 
0.5 
a 
> 
0 1 —0.5 0.5 


Figure C.1 (a) Discrete distribution function with probabilities. (b) Continuous 
distribution function with a uniform probability density 
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An example of a continuous random variable is the round-off error 
which occurs when a random real number is replaced with its nearest 
integer number. This error is uniformly distributed between —0.5 and 
0.5. The distribution function and associated probability density are 
shown in Figure C.1(b). 


Since F(x) is a non-decreasing function of x, and F(oo) = 1, the density 


must be a non-negative function with f p(x)dx = 1. Here, the integral 
extends over the real numbers from —oo to oo. 


C.1.1 Moments 


The moment of order n of a random variable is defined as: 
E[x"] = i. | x"p(x)dx (c.4) 


Formally, the notation should be E[x”], but as said before we omit 
the underscore if there is no fear of ambiguity. Another notation of 
E[x”] is x”. 

The first order moment is called the expectation. This quantity is often 
denoted by ux or (if confusion is not to be expected) briefly u. The 
central moments of order n are defined as: 


The first central moment is always zero. The second central moment is 
called variance: 


Varlx] = El(x — w)?] = Ep? — p? (c.6) 


The (standard) deviation of x denoted by ox, or briefly c, is the square 
root of the variance: 


ERN EP (c.7) 
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C.1.2 Poisson distribution 


The act of counting the number of times that a certain random 
event takes place during a given time interval often involves a Poisson 
distribution. A discrete random variable n which obeys the Poisson 
distribution has the probability function: 


A” exp(—A) 
n! 


(c.8) 





P(n=n) = 


A is a parameter of the distribution. It is the expected number of events. 
Thus, E[n] = à. A special feature of the Poisson distribution is that the 
variance and the expectation are equal: Var[n] = E[n] = A. Examples of 
this distribution are shown in Figure C.2. 


Example 

Radiant energy is carried by a discrete number of photons. In 
daylight situations, the average number of photons per unit area 
is in the order of 10” (1/(s -mm?)). However, in fact the real 
number is Poisson distributed. Therefore, the relative deviation 
o,/E[n] is 1//X. An image sensor with an integrating area of 
100 (um?), an integration time of 25(ms) and an illuminance of 
250 (Ix) receives about \ = 10° photons. Hence, the relative devia- 
tion is about 0.1%. In most imaging sensors this deviation is almost 
negligible compared to other noise sources. 


C.1.3 Binomial distribution 


Suppose that we have an experiment with only two outcomes, w1 and w2. 
The probability of the elementary event {w1} is denoted by P. Conse- 
quently, the probability of {w2} is 1— P. We repeat the experiment 





0.2 0.2 0.2 
p(n) p(n) 
p(n) A=4 A=7 A=10 
0.1 0.1 0.1 
0 0 0 
0 10, 2 0 10 p 20 0 10 p 20 


Figure C.2 Poisson distributions 
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N times, and form a random variable n by counting the number of times 
that {w1} occurred. This random variable is said to have a binomial 
distribution with parameters N and P. 

The probability function of n is: 


N! 


P(n =n) = aN ony P”(1 — P)” (c.9) 





The expectation of n appears to be E[n] = NP and its variance is 
Var[n] = NP(1 — P). 


Example 

The error rate of a classifier is E. The classifier is applied to N objects. 
The number of misclassified objects, e707, has a binomial distribution 
with parameters N and E. 


C.1.4 Normal distribution 


A well-known example of a continuous random variable is the one with 
a Gaussian (or normal) probability density: 





2 
p(x) = - exp Ea ) (c.10) 


The parameters u and o? are the expectation and the variance, respect- 
ively. Two examples of the probability density with different u and o? 
are shown in Figure C.3. 

Gaussian random variables occur whenever the underlying process is 
caused by the outcomes of many independent experiments and the 
associated random variables add up linearly (the central limit theorem). 
An example is thermal noise in an electrical current. The current is 
proportional to the sum of the velocities of the individual electrons. 
Another example is the Poisson distributed random variable mentioned 
above. The envelope of the Poisson distribution approximates the Gauss- 
ian distribution as à tends to infinity. As illustrated in Figure C.2, the 
approximation looks quite reasonable already when à > 10. 

Also, the binomial distributed random variable is the result of an 
addition of many independent outcomes. Therefore, the envelope of 
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p(x) 


g1 











H be x 


Figure C.3 Gaussian probability densities 


the binomial distribution also approximates the normal distribution as 
NP(1 — P) tends to infinity. In practice, the approximation is already 
reasonably good if NP(1 — P) > 5. 


C.1.5 The Chi-square distribution 


Another example of a continuous distribution is the x2 distribution. 
A random variable y is said to be y2 distributed (Chi-square distributed 
with n degrees of freedom) if its density function equals: 


(ty y>=0 


(eal) 


0 elsewhere 


with T() the so-called gamma function. The expectation and variance of 
a Chi-square distributed random variable appear to be: 


Ep] 
Var[y] = 2n 


n 


(c.12) 


Figure C.4 shows some examples of the density and distribution. 

The x2 distribution arises if we construct the sum of the square of 
normally distributed random variables. Suppose that x; with j = 1,...,7 
are n Gaussian random variables with E[x;]=0 and Var[x,] = 1. 
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0 5 10 15 





Figure C.4 Chi-square densities and (cumulative) distributions. The degrees of 
freedom, n, varies from 1 up to 5 


In addition, assume that these random variables are mutually independ- 
ent,” then the random variable: 


y= Dae (c.13) 
j=1 


is x2, distributed. 


C.2 BIVARIATE RANDOM VARIABLES 


In this section we consider an experiment in which two random variables 
x and y are associated. The joint distribution function is the probability 
that x < x and y <y, i.e. 

F(x,y) = P(x < x,y < y) (c.14) 


The function p(x,y) for which: 


Foy) = f f etemdnde (c:15) 





? For the definition of ‘independent’: see Section C.2. 
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is called the joint probability density. Strictly speaking, definition (c.15) 
holds true only when F(x,y) is continuous. However, by permitting 
p(x,y) to contain Dirac functions, the definition also applies to the 
discrete case. 

From definitions (c.14) and (c.15) it is clear that the marginal distri- 
bution F(x) and the marginal density p(x) are given by: 


F(x) = F@;60) (c.16) 


p(x) = f pæ y)dy (c.17) 
y 


=— 00 


Two random variables x and y are independent if: 

Fxy(x, y) = Fx(x)Fy(y) (c.18) 
This is equivalent to: 

Dxy(,Y) = Px(x)Py(9) (c.19) 


Suppose that /(-,-) is a function R x R — R. Then h(x,y) is a random 
variable. The expectation of h(x,y) equals: 


E[A(x, 9) = f à f bæ yple,y)dyde  (c.20) 


The joint moments mj of two random variables x and y are defined as 
the expectations of the functions x'y’: 


mi = E[x'y’] (c.21) 


The quantity 7 + j is called the order of mj. It can easily be verified that: 
moo = 1, mio = Efx] = ux and mo, = Ely] = py. 
The joint central moments uij of order i+ j are defined as: 


pj = E| (x — me = Hy) (c.22) 
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Clearly, u20 = Var[x] and uo2 = Var[y]. Furthermore, the parameter 111 
is called the covariance (sometimes denoted by Cov[x,y]). This para- 
meter can be written as: p11 = m11 — 711901. Two random variables are 
called uncorrelated if their covariance is zero. Two independent random 
variables are always uncorrelated. The reverse is not necessarily true. 

Two random variables x and y are Gaussian if their joint probability 
density is: E 


1 
x,y) = — 
p(x, y) 2roxoyvV 1 — r2 


os —1 (e arent), 0p) 
y 2(1 —77) a OxOy | Ge 





The parameters ux and uy are the expectations of x and y, respectively. 
The parameters ox and oy are the standard deviations. The parameter r is 
called the correlation coefficient, defined as: 


r= 





ait > Coney (c.24) 
V H20H02 OxOy 


Figure C.5 shows a scatter diagram with 121 realizations of two Gauss- 
ian random variables. In this figure, a geometrical interpretation of 
(c.23) is also given. The set of points (x,y) which satisfy: 


D(x, y) = P( Mx; Hy) exp (- 5) 


(i.e. the 1o-level contour) turns out to form an ellipse. The centre of 
this ellipse coincides with the expectation. The eccentricity, size and 
orientation of the 1o contour describe the scattering of the samples 
around this centre. Vào and \/A; are the standard deviations of the 
samples projected on the principal axes of the ellipse. The angle 
between the principal axis associated with Ao and the x axis is 0. 
With these conventions, the variances of x and y, and the correlation 
coefficient r, are: 7 
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1 o contour 
2 o contour 
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Hy= Mo 














Hx = Mio x 





Figure C.5 Scatter diagram of two Gaussian random variables 


= ào cos? 0 + À sin? 0 
=X 





oh eh 


2A 2 
1 cos“ 8+ ào sin^ 6 (c.25) 





ne (Ao — A1) sin @ cos 0 


Tx Ty 


Consequently, r = 0 whenever Ao = A; (the ellipse degenerates into a 
circle), or whenever 0 is a multiple of 7/2 (the ellipse lines up with the 
axes). In both cases the random variables are independent. The conclusion 
is that two Gaussian random variables which are uncorrelated are also 
independent. 

The situation r = 1 or r = —1 occurs only if ào = 0 or A; = 0. The 
ellipse degenerates into a straight line. Therefore, if two Gaussian ran- 
dom variables are completely correlated, then this implies that two 
constants a and b can be found for which y = ax + b. 

The expectation and variance of the random variable defined by 
z = ax + by are: 
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E{z] = aE[x] + bE[y] (c.26) 


e=aot bo? + 2ab Cov|x, y] (c.27) 


A 


The expectation appears to be a linear operation, regardless of the joint 
distribution function. The variance is not a linear operation. However, 
(c.27) shows that if two random variables are uncorrelated, then 
yy 7 Ox + o5. 

Another important aspect of two random variables is the concept of 
conditional probabilities and moments. The conditional distribution 
F,,y(x|y) (or in shorthand notation F(x|y)) is the probability that x < x 


given that y < y. Written symbolically: 
Fyy(xly) = F(xly) = P(x < xly < y) (c.28) 


The conditional probability density associated with F,),(x|y) is denoted 
by pxly(x|y) or p(x|y). Its definition is similar to (c.3). An important 


property of conditional probability densities is Bayes’ theorem for con- 
ditional probabilities: 


Prxiy(X|¥)Py(y) = Pxy(X, y= Py\x(¥|x)Px(*) (c.29) 
or in shorthand notation: 


P(xly)P(y) = P(x, y) = P(x) p(x) 


Bayes’ theorem is very important for the development of a classifier or 
an estimator; see Chapter 2 and 3. 
The conditional moments are defined as: 


Ek'y=s]= | x"plaly)dx (€30) 


The shorthand notation is E[x”|y]. The conditional expectation 
and conditional variance are sometimes denoted by ux and Ty 
respectively. 
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C.3 RANDOM VECTORS 


In this section, we discuss a finite sequence of N random variables: 
Xo5X1,---5Xny_1- We assume that these variables are arranged in an 
N-dimensional random vector x. The joint distribution function F(x) 
is defined as: 


F(x) = P(X < x0,%1 <%1,.--,%n_1 < XN-1) (c.31) 


The probability density p(x) of the vector x is the function that satisfies: 
x)= f pO (c.32) 


with: 


f ros- fo J = [T Edena dedéo 


The expectation of a function g(x): R —> R is: 


Beto] =f g(x)p(x)dx (c.33) 


Similar definitions apply to vector-to-vector mappings (R — RY) and 
vector-to-matrix mappings (R — RN x RY). Particularly, the expect- 
ation vector UW, = E[x] and the covariance matrix Cx = E[(x — y,) 
(x — 4)" ] are frequently used. 

A Gaussian random vector has a probability density given by: 








1 eigi e) 
p(x) = »( (c.34) 
D : j 


The parameters 4, (expectation vector) and Cx (covariance matrix) fully 
define the probability density. 

A random vector is called uncorrelated if its covariance matrix is a 
diagonal matrix. If the elements of a random vector x are independent, 
the probability density of x is the product of the probability densities of 
the elements: 
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N-1 


p(x) = J [ pln) (c.35) 


n=0 


Such a random vector is uncorrelated. The reverse holds true in some 
specific cases, e.g. for all Gaussian random vectors. 

The conditional probability density p(x|y) of two random vectors x 
and y is the probability density of x if y is known. Bayes’ theorem for 
conditional probability densities becomes: 


p(xly)P(y) = p(x, y) = plylx)p(x) (c.36) 


The definitions of the conditional expectation vector and the conditional 
covariance matrix are similar. 


C.3.1 Linear operations on Gaussian random vectors 


Suppose the random vector y results from a linear (matrix) operation 
y = Ax. The input vector of this operator x has expectation vector LL, 
and covariance matrix Cx, respectively. Then, the expectation of the 
output vector and its covariance matrix are: 


Ho es F (c.37) 

Cy = ACA 
These relations hold true regardless of the type of distribution functions. 
In general, the distribution function of y may be of a type different 
from the one of x. For instance, if the elements from x are uniformly 
distributed, then the elements of y will not be uniform except for trivial 
cases (e.g. when A = I). However, if x has a Gaussian distribution, then 
so has y. 


Example 

Figure C.6 shows the scatter diagram of a Gaussian random vector x, 
the covariance matrix of which is the identity matrix I. Such a vector 
is called white. Multiplication of x by a diagonal matrix A'” yields a 
vector y with covariance matrix A. This vector is still uncorrelated. 
Application of a unitary matrix V to z yields the random vector z, 
which is correlated. 
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Figure C.6 Scatter diagrams of Gaussian random vectors applied to two linear 
operators 


C.3.2 Decorrelation 


Suppose a random vector z with covariance matrix C, is given. Decor- 
relation is a linear operation A which, when applied to z, will give a white 
random vector x. (The random vector x is called white if its covariance 
matrix Cx = T.) The operation A can be found by diagonalization of the 
matrix Cz. To see this, it suffices to recognize that the matrix C; is self- 
adjoint, i.e. C, = C7. According to Section B.5 a unitary matrix V and a 
(real) diagonal matrix A must exist such that C, = VAV”. The matrix V 
consists of the normalized eigenvectors v, of Cz, i.e. V = [vo <- VN-1]. 
The matrix A contains the eigenvalues 2, at the diagonal. Therefore, 
application of the unitary transform V” yields a random vector y = V‘z 
the covariance matrix of which is A. Furthermore, the operation A~!” 
applied to y gives the white vector x = A~"?y. Hence, the decorrelation/ 
whitening operation A equals A~!’2V". Note that the operation A~!/2,V7 
is the inverse of the operations shown in Figure C.6. 
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We define the 1o-level contour of a Gaussian distribution as the 
solution of: 


(z—m,)'C,'(2-p,) =1 (c.38) 


The level contour is an ellipse (N = 2), an ellipsoid (N = 3) or a 
hyperellipsoid (N > 3). A corollary of the above is that the principal 
axes of these ellipsoids point in the direction of the eigenvectors of C,, 
and that the ellipsoids intersect these axes at a distance from 4, equal to 
the square root of the associated eigenvalue. See Figure C.6. 


C.4 REFERENCE 


Papoulis, A., Probability, Random Variables and Stochastic Processes, McGraw-Hill, 
New York, 1965 (third edition: 1991). 


Appendix D 


Discrete-time Dynamic 
Systems 


This brief review is meant as a refresher for readers who are familiar 
with the topic. It summarizes those concepts that are used within the 
textbook. It also introduces the notations adopted in this book. 


D.1 DISCRETE-TIME DYNAMIC SYSTEMS 


A dynamic system is a system whose variables proceed in time. A concise 
representation of such a system is the state space model. The model 
consists of a state vector x(i) where i is an integer variable representing 
the discrete time. The dimension of x(i), called the order of the system, 
is M. We assume that the state vector is real-valued. The finite-state case 
is introduced in Chapter 4. 

The process can be influenced by a control vector (input vector) u(i) 
with dimension L. The output of the system is given by the measurement 
vector (observation vector) z(i) with dimension N. The output is mod- 
elled as a memoryless vector that depends on the current values of the 
state vector and the control vector. 

By definition, the state vector holds the minimum number of variables 
which completely summarize the past of the system. Therefore, the state 


Classification, Parameter Estimation and State Estimation: An Engineering Approach using MATLAB 
F. van der Heijden, R.P.W. Duin, D. de Ridder and D.M.J. Tax 
© 2004 John Wiley & Sons, Ltd ISBN: 0-470-09013-8 


400 APPENDIX D 


vector at time i+ 1 is derived from the state vector and the control 
vector, both valid at time i: 


z(i) = h(x(é), u(7), 7) ) 

f(.) is a possible nonlinear vector function, the system function, that may 
depend explicitly on time. h(.) is the measurement function. 

Note that if the time series starts at i = in, and the initial condition x(io) 

is given, along with the input sequence u(ig), u(żo + 1),..., then (d.1) can 

be used to solve x(i) for every i > ig. Such a solution is called a trajectory. 


D.2 LINEAR SYSTEMS 


If the system is linear, then the equations evolve into matrix-vector 
equations according to: 


x(i+ 1) = F(i)x(a) + L()u(a) (d.2) 
z(i) = H(i)x(i) + D(@)u(a) 
The matrices F(i), L(i), H(i) and D(i) must have the appropriate dimen- 
sions. They have the following names: 


F(i) system matrix 

L(i) distribution matrix (gain matrix) 
i) measurement matrix 

i) feedforward gain 


Often, the feedforward gain is zero. 
The solution of the linear system is found by introduction of the 
transition matrix ®(i, ig), recursively defined by: 


(io, io) =I 


d.3 
B(i+1,i9) = F(ÒP(i io) for i= io iot1,... eaa 


Given the initial condition x(iọ) at i = ip, and assuming that the feed- 
forward gain is zero, the solution is found as: 


i—1 
x(i) = (i, io)x(io) + X` B(G j + DL(uls) (d.4) 


j=io 
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D.3 LINEAR TIME INVARIANT SYSTEMS 


In the linear time invariant case, the matrices become constants: 


x(i+ 1) = Fx(i) + Lu() 


z(i) = Hx(i) + Du(i) 2) 
and the transition matrix simplifies to: 
®(i, io) = (i — io) = F7” (d.6) 
With that, the input/output relation of the system is: 
i-1 
2(i) = HFx(i9) + S > HF’ 'Lu(j) + Du(i) (d.7) 


j=io 


The first term at the r.h.s. is the free response; the second term is the 
particular response; the third term is the feedforward response. 


D.3.1 Diagonalization of a system 


We assume that the system matrix F has M distinct eigenvectors v, with 
corresponding (distinct) eigenvalues A,. Thus, by definition Fv = Agvg. 
We define the eigenvalue matrix A as the M x M diagonal matrix con- 
taining the eigenvalues at the diagonal, i.e. Ay, = Ag, and the eigenvec- 
tor matrix V as the matrix containing the eigenvectors as its column 
vectors, i.e. V = [Vo --- Vm-1 ]. Consequently: 








FV=VA F= VAV! (d.8) 

A corollary of (d.8) is that the power of F is calculated as follows: 
F = (VAV!) = VAt V! (d.9) 
Equation (d.9) is useful to decouple the system into a number of (scalar) 
first order systems, i.e. to diagonalize the system. For that purpose, 


define the vector 


y(i) = V_'x(i) (d.10) 
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Then, the state equation transforms into: 


yi+1)=V'x(i+1) 
= V_'Fx(i) + V'Lu(i) 


, i (d.11) 
= V-'EVy(i) + V~'Lu(i) 
= Ay(i) + V'Lu(i) 
and the solution is found as: 
y(i) = AP *y(io) + X AM IVLu(i) (d.12) 
j=io 
The k-th component of the vector y(i) is: 
E E ee 
yeli) = Xg yelio) + X Ap (V'Lu(i)) (d.13) 


j=io 


The measurement vector is obtained from y(i) by substitution of 
x(i) = Vy(i) in equation (d.5). 

If the system matrix does not have distinct eigenvalues, the situation is 
a little more involved. In that case, the matrix A gets the Jordan form 
with eigenvalues at the diagonal, but with ones at the superdiagonal. The 
free response becomes combinations of terms pg(i — io)à, © where p}(i) 
are polynomials in 7 with order one less than the multiplicity of Ag. 


D.3.2 Stability 


In dynamic systems there are various definitions of the term stability. We 
return to the general case (d.1) first, and then check to see how the 
definition applies to the linear time invariant case. Let x,(i) and x,(i) be 
the solutions of (d.1) with a given input sequence and with initial 
conditions x,(i9) and x;,(i9), respectively. 

The solution x,(i) is stable if for every £ > 0 we can find a 6 > 0 such 
that for every xp(io) with ||xp(io)— xg(io)|| <6 is such that 
||xp(7) — Xq(i)|| < £ for all i > io. Loosely speaking, the system is stable 
if small changes in the initial condition of stable solution cannot lead to 
very large changes of the trajectory. 
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The solution is x,(i) is asymptotically stable, if it is stable, and if 
||xp (7) — Xa(i)|| — 0 as i — œ provided that ||x,(io) — xa(io)|| is not too 
large. Loosely speaking, the additional requirement for a system to be 
asymptotically stable is that the initial condition does not influence the 
solution in the long range. 

For linear systems, the stability of one solution assures the stability of 
all other solutions. Thus, stability is a property of the system then; not 
just of one of its solutions. From (d.13) it can be seen that a linear time 
invariant system is asymptotically stable if and only if the magnitude of 
the eigenvalues are less than one. That is, the eigenvalues must all be 
within the (complex) unit circle. 

A linear time invariant system has BIBO stability (bounded input, 
bounded output) if a bounded input sequence always gives rise to a 
bounded output sequence. Note that asymptotical stability implies BIBO 
stability, but the reverse is not true. A system can be BIBO stable, while 
it is not asymptotical stable. 


D.4 REFERENCES 


Åström, K.J. and Wittenmark, B., Computer-Controlled Systems - Theory and Design, 
Second Edition, Prentice Hall International, Englewood Cliffs, NJ, 1990. 

Luenberger, D.G., Introduction to Dynamic Systems, Theory, Models and Applications, 
Wiley, Toronto, 1979. 


Appendix E 


Introduction to PRTools 


E.1 MOTIVATION 


In statistical pattern recognition we study techniques for the general- 
ization of decision rules to be used for the recognition of patterns in 
experimental data sets. This area of research has a strong computational 
character, demanding a flexible use of numerical programs for data 
analysis as well as for the evaluation of the procedures. As new methods 
keep being proposed in the literature, a programming platform is needed 
that enables a fast and flexible implementation of such algorithms. 
Because of its widespread availability, its simple syntax and general 
nature, MATLAB is a good choice for such a platform. 

The pattern recognition routines and support functions offered by 
PRTools represent a basic set covering largely the area of statistical 
pattern recognition. Many methods and proposals, however, are not yet 
implemented. Neural networks are only implemented partially, as 
MATLAB already includes a very good toolbox in that area. PRTools has 
a few limitations. Due to the heavy memory demands of MATLAB, very 
large problems with learning sets of tens of thousands of objects cannot be 
handled on moderate machines. Moreover, some algorithms are slow as it 
can be difficult to avoid nested loops. In the present version, the handling 
of missing data has been prepared, but no routines are implemented yet. 
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Table E.1 Notation differences between this book and the PRTools 
documentation 





Mathematical Notation in PRTools Meaning 

notation pseudo-code notation 

T KZ a,b data set 

n n m number of objects 

N, D N,D k,n number of features, dimensions 
K K C number of classes 


The use of fuzzy or symbolic data is not supported, except for soft (and 
thereby also fuzzy) labels which are used by just a few routines. Multi- 
dimensional target fields are allowed, but at this moment no procedure 
makes use of this possibility. 

The notation used in the PRTools documentation and code differs 
slightly from that used in the code throughout this book. In this appendix 
we try to follow the notation in the book. In Table E.1 notation differ- 
ences between this book and the PRTools documentation are given. 


E.2 ESSENTIAL CONCEPTS IN PRTOOLS 


For recognizing the classes of objects they are first scanned by sensors, then 
represented, e.g. in a feature space, and after some possible feature reduc- 
tion steps they are finally mapped by a classifier to the set of class labels. 
Between the initial representation in the feature space and this final map- 
ping to the set of class labels the representation may be changed several 
times: simplified feature spaces (feature selection), normalization of features 
(e.g. by scaling), linear or nonlinear mappings (feature extraction), classifi- 
cation by a possible set of classifiers, combining classifiers and the final 
labelling. In each of these steps the data is transformed by some mapping. 

Based on this observation. PRTools defines the following two basic 
concepts: 


e Data sets: matrices in which the rows represent the objects and the 
columns the features, class memberships or other fixed sets of 
properties (e.g. distances to a fixed set of other objects). 

e Mappings: transformations operating on data sets. 


As pattern recognition has two stages, training and execution, map- 
pings also have two main types: 
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e An untrained mapping refers to just the concept of a method, e.g. 
forward feature selection, PCA or Fisher’s linear discriminant. It may 
have some parameters that are needed for training, e.g. the desired 
number of features or some regularization parameters. If an untrained 
mapping is applied to a data set it will be trained by it (training). 

e A trained mapping is specific for its training set. This data set 
thereby determines the input dimensionality (e.g. the number of 
input features) as well as the output dimensionality (e.g. the num- 
ber of output features or the number of classes). If a trained map- 
ping is applied to a data set, it will transform the data set according 
to its definition (execution). 


In addition, fixed mappings are used. They are almost identical to 
trained mappings, except that they do not result from a training step, but 
are directly defined by the user: e.g. the transformation of distances by a 
sigmoid function to the [0, 1] interval. 

PRTools deals with sets of labelled or unlabelled objects and offers 
routines for the generalization of such sets into functions for mapping 
and classification. A classifier is a special case of a mapping, as it maps 
objects on class labels or on [0,1] intervals that may be interpreted as 
class memberships, soft labels or posterior probabilities. An object is a 
N-dimensional vector of feature values, distances, (dis)similarities or 
class memberships. Within PRTools they are usually just called features. 
It is assumed that for all objects in a problem all values of the same set of 
features are given. The space defined by the actual set of features is 
called the feature space. Objects are represented as points or vectors in 
this space. New objects in a feature space are usually gradually con- 
verted to labels by a series of mappings followed by a final classifier. 


E.3 IMPLEMENTATION 


PRTools uses the object-oriented features of the MATLAB programming 
language. Two object classes (not to be confused with the objects and 
classes in pattern recognition) have been defined: dataset and 
mapping. A large number of operators (like *, [] etc.) and MATLAB 
commands have been overloaded and have a special meaning when 
applied to a dataset and/or a mapping. 

The central data structure of PRTools is the dataset. It primarily 
consists of a set of objects represented by a matrix of feature vectors. 
Attached to this matrix is a set of labels, one for each object and a set of 
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feature names, also called feature labels. Labels can be integer numbers 
or character strings. Moreover, a set of prior probabilities, one for each 
class, is stored. In most help files of PRTools, a dataset is denoted by a. 
We will use z in this text to stay consistent with the rest of the book. In 
almost any routine this is one of the inputs. Almost all routines can 
handle multi-class object sets. It is possible that for some objects no 
label is specified (a NaN is used, or an empty string). Such objects are, 
unless otherwise mentioned, skipped during training. 

Data structures of the object class mapping store data transforma- 
tions (‘mappings’), classifiers, feature extraction results, data scaling 
definitions, nonlinear projections, etc. They are usually denoted by w. 
The easiest way of applying a mapping w to a data set z is by z*w. The 
matrix multiplication symbol * is overloaded for this purpose. This 
operation may also be written as map(z, w). Like everywhere else in 
MATLAB, longer series of operations are possible, e.g. z*w1 *w2*w3, and 
are executed from left to right. 


Listing E.1 


oe 


zZ=gendath([5050]); 

ly,x] =gendat(z, [20 20]); 

wl=ldc(y); 

w2 =qdc (y); 

w3 = parzenc (y); Parzen density-based 

w4=bpxnec(y,3); neural net with 3 hidden units 

disp([testc(x*wl),testc(x*w2),testc(x*w3),testc(x*w4)]); 
% compute and display classification errors 

scatterd(z); % scatter plot of data 

plotc({wl,w2,w3,w4}); % plot decision boundaries 


generate data, 50 objects/class 
split into training andtest set 
linear classifier 

quadratic 


AP dP Æ Æ 


oe 


A typical example is given in Listing E.1. This command file first 
generates two sets of labelled objects using gendath, both containing 
50 two-dimensional object vectors, and stores them, their labels and 
prior probabilities in the dataset z. The distribution follows the so-called 
‘Higleyman classes’. The next call to gendat takes this dataset and 
splits it randomly into a dataset y, further on used for training, and a 
dataset x, used for testing. This training set y contains 20 objects 
from each class. The remaining 2 x 30 objects are collected in x. 

In the next lines four classification functions (discriminants) are com- 
puted, called w1, w2, w3 and w4. The first three are in fact density 
estimators based on various assumptions (class priors stored in y are 
taken into account). Programmatically, they are just mappings, as 
xx = x*w1 computes the class densities for the objects stored in d. 
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xx has as many columns as there are classes in the training set for w1 (here 
two). The test routine testc assigns objects (represented by the rows in xx) 
to the class corresponding to the highest density (times prior probability), 
the mappings w1, ... , w4 can be used as classifiers. The linear classifier w1 
(Ide) and quadratic classifier w2 (qdc) are both based on the assumption of 
normally distributed classes. The first assumes equal class covariance 
matrices. The Parzen classifier estimates the class densities by the Parzen 
density estimation and has a built-in optimization of the kernel width. The 
fourth classifier uses a feed forward neural network with three hidden units. 
It is trained by the back propagation rule using a varying step size. 

The results are then displayed and plotted. The test data set x is used 
in a routine testc (test classifier) on each of the four discriminants. 
They are combined in a cell array, but individual calls are possible as 
well. The estimated probabilities of error are displayed in the MATLAB 
command window and may look like (note that they may differ due to 
different random seeds used in the generation of the data): 


0.1500 0.0333 0.1333 0.0833 


Finally the classes are plotted in a scatter diagram (scatterd) 
together with the discriminants (plotc); see Figure E.1 





Feature 2 












ŁA --- Bayes—Normal-—2 
sree Parzen Classifier 








Feature 1 


Figure E.1 Example output of Listing E.1 
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For more advanced examples, see the Examples section in help 
prtools. 


E.4 SOME DETAILS 


The command help files and the examples should give sufficient infor- 
mation to use the toolbox with a few exceptions. These are discussed in 
the following sections. They deal with the ways classifiers and mappings 
are represented. As these are the constituting elements of a pattern 
recognition analysis, it is important that the user understands these 
issues. 


E.4.1 Data sets 





A dataset consists of a set of n objects, each given by D features. 
In PRTools such a data set is represented by an n by D matrix: n rows, 
each containing an object vector of D features. Usually a data set is 
labelled. An example of a definition is: 








> z=dataset([123; 234; 345; 45 6],[3; 3; 5; 5]) 
4 by 3 dataset with 2 classes: [2 2] 


The 4 x 3 data matrix (four objects given by three features) is accom- 
panied by a label list of four labels, connecting each of the objects to 
one of the two classes, labelled 3 and 5. Class labels can be numbers 
or strings and should always be given as rows in the label list. If 
the label list is not given all objects are given the default label 1. 
In addition it is possible to assign labels to the columns (features) of a 
data set: 


> z=dataset (rand(100,3),genlab([5050],[3; 5])); 
> z=setfeatlab(z, ['rl’;’r2'’;'r3']) 
100 by 3 dataset with 2 classes: [50 50] 


The routine genlab generates 50 labels with value 3, followed by 50 
labels with value 5. Using set featlab the labels ('r1’, 'r2’, 'r3") for 
the three features are set. Various other fields can be set as well. One of 
the ways to see these fields is by converting the data set to a structure, 
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using struct(x). Fields can be inspected individually by the .- exten- 
sion, also defined for data sets: 


> x.lablist 
ans= 
3 


The possibility to set prior probabilities for each of the classes by 
setprior(x, prob, lablist) is important. The prior values in 
prob should sum to one. If prob is empty or if it is not supplied the 
prior probabilities are computed from the data set label frequencies. If 
prob equals zero then equal class probabilities are assumed. 

Various items stored in a data set can be retrieved by commands like 
getdata, getlablist and getnlab. The last one retrieves the 
numeric labels for the objects (1, 2,...) referring to the true labels stored 
in the rows of lablist. The size of the data set can be found by: 


[n,D] =size(x); or [n,D,K] =getsize (xa); 





in which n is the number of objects, D the number of features and K the 
number of classes (equal to max(nlab)). Data sets can be combined by 
[x; y] if x and y have equal numbers of features and by [x y] if they 
have equal numbers of objects. Creating subsets of data sets can be done 
by z(I,J) in which T is a set of indices defining the desired objects and J 
is a set of indices defining the desired features. 

The original data matrix can be retrieved by double(z) or by +z. The 
labels in the objects of x can be retrieved by labels = get labels(z), 
which is equivalent to 


— 








[nlab, lablist] =get (z,’nlab’,’lablist’); 
labels=lablist(nlab,:); 


Be aware that the order of classes returned by get prob and getlablist 


is the standard order used in PRTools and may differ from the one used in the 
definition of z. For more information, type help datasets. 


E.4.2 Classifiers and mappings 


There are many commands to train and use mappings between spaces of 
different (or equal) dimensionalities. In general, the following applies: 
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if z is an n by D data set (n objects in a D-dimensional 
space) 

and w is a D by K mapping (map from D to K dimensions) 

then z*w is ann by K data set (n objects in an k-dimensional 
space) 


Mappings can be linear or affine (e.g. a rotation and a shift) as well as 
nonlinear (e.g. a neural network). Typically they can be used as classi- 
fiers. In that case a D by K mapping maps a D-feature data vector on the 
output space of an K-class classifier (exception: two-class classifiers like 
discriminant functions may be implemented by a mapping to a one- 
dimensional space like the distance to the discriminant, K = 1). 

Mappings are of the data type ‘mapping’ (class(w) is mapping’), 
have a size of [D, K] if they map from D to K dimensions. Mappings can 
be instructed to assign labels to the output columns, e.g. the class names. 
These labels can be retrieved by 














labels=getlabels(w); tbefore the mapping, or 
labels=getlabels (z*w); tafter the data set z is mapped by w. 


Mappings can be learned from examples, (labelled) objects stored in a 
data set z, for instance by training a classifier: 


wl=lde 2) ¢ the normal densities based linear classifier 
w2 = knitic (243) 4 Sthe 3-nearest neighbour rule 
w3=svc(z,'p’,2); %the support vector classifier based ona 2nd 


order polynomial kernel 


Untrained or empty mappings are supported. They may be very useful. 
In this case the data set is replaced by an empty set or entirely skipped: 


vl=lde; v2=knne([],3); v3=sve(l],"p',2);3 
Such mappings can be trained later by 
wl=z*vl; w2=Z2*v2; w3=2Z*v3; 


(which is equivalent to the statements a few lines above) or by using cell 
arrays 


v={ldc, kane ([], 3), svc([],’p',2)}; w=z*v; 
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The mapping of a test set y by y*w1 is now equivalent to y*(z*v1). 
Note that expressions are evaluated from left to right, so y*z*v1 will 
result in an error as the multiplication of the two data sets (y*x) is 
executed first. 

Some trainable mappings do not depend on class labels and can be 
interpreted as finding a feature space that approximates as good as 
possible the original data set given some conditions and measures. 
Examples are the Karhunen—Loéve mapping (k1m), principal component 
analysis (pca) and kernel mapping (kernelm) by which nonlinear, 
kernel PCA mappings can be computed. 

In addition to trainable mappings, there are fixed mappings, the 
parameters of which cannot be trained. A number of them can be set 
by cmapm; others are sigm and invsigm. 

The result x of mapping a test set on a trained classifier, x = y*w1, is 
again a data set, storing for each object in y the output values of the 
classifier. For discriminants they are sigmoids of distances, mapped on 
the [0, 1] interval, for neural networks their unnormalized outputs and 
for density based classifiers the densities. For all of them the following 
holds: the larger, the more similar with the corresponding class. The 
values in a single row (object) don’t necessarily sum to one. This can be 
achieved by the fixed mapping classc: 


x=y*wl*classc 


The values in x can be interpreted as posterior probability estimates or 
classification confidences. Such a classification data set has column 
labels (feature labels) for the classes and row labels for the objects. 
The class labels of the maximum values in each object row can be 
retrieved by 


labels=x*labeld; or labels=labeld(x) ; 
A global classification error follows from 
e=x*testc; or e=testc(x); 

Mappings can be inspected by entering w by itself, or using 
display(w). This lists the size and type of a mapping or classifier as 
well as the routine used for computing a mapping z*w. The data stored 


in the mapping might be retieved using +w. Note that the type of data 
stored in each mapping depends on the type of mapping. 


414 APPENDIX E 


Affine mappings (e.g. constructed by klm) may be transposed. This is 
useful for back projection of data into the original space. For instance: 


w=klm(x,3); % compute 3-dimensional KL transform 
y=x*w; % map x using w, resulting iny 
Z=y*w'; % back-projection to the original space 


A mapping may be given an output selection by w = w(: , J), in which J isa 

set of indices pointing to the desired classes; y = z*w(: , J); is equivalent 

toy = z*w; y =y(:,J);. Input selection is not possible for a mapping. 
For more information, type help mappings. 


E.S HOW TO WRITE YOUR OWN MAPPING 


Users can add new mappings or classifiers by a single routine that should 
support the following type of calls: 


e w=newmapm([], parl,...); 
Defines the untrained, empty mapping. 
e w=newmapm(z, parl,...); 
Defines the map based on the training data set z. 





e y =newmapm(z, w); 


Defines the mapping of data set z using w, resulting in a data set y. 
For an example, list the routine subsc (using typesubsc). This classifier 
approximates each class by a linear subspace and assigns new objects to 
the class of the closest subspace found in the training set. The dimension- 
alities of the subspaces can be directly set by w = subsc(z, n), in which 
the integer n determines the dimensionality of all class subspaces, or by 
w = subsc(z, alf), in which alf is the desired fraction of variance to 
retain, e.g. alf = 0.95. In both cases the class subspaces v (see the listing) 
are determined by a principal component analysis of the single-class 
data sets. 

The three possible types of calls, listed above, are handled in the three 
main parts of the routine. If no input parameters are given 
(nargin <1) or no input data set is found (z is empty) an untrained 
classifier is returned. This is useful for calls like w = subsc([],n), defin- 
ing an untrained classifier to be used in routines like cleval(z,w,...) 
that operate on arbitrary untrained classifiers, but also to facilitate 
training by constructions as w = z*subsc or w = z*subsc([],n). 
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The training section of the routine is accessed if z is not empty and n is 
either not supplied or set by the user as a double (i.e. the subspace 
dimensionality or the fraction of the retained variance). PRTools 
takes care that calls like w= z*subsc([],n) are executed as 
w= subsc(z,n). The first parameter in the mapping definitions 
w=mapping(mfilename, ... is substituted by MATLAB as ‘subsc’ 
(m£ilename is a MATLAB function that returns the name of the calling 
file). This string is stored by PRTools in the mapping_file field of the 
mapping w and used to call subsc whenever it has to be applied to a 
data set. For some special mappings, like 1dc, another file might be used 
(in the case of 1dc it is normal_map). 

The trained mapping w can be applied to a test data set y by x = 
y*w or by z=map(y,w). Such a call is converted by PRTools to 
x = subsc(y,w). Consequently, the second parameter of subsc(), n, 
is now substituted by the mapping w. This is executed in the final part of 
the routine. Here, the data stored in the data field of w during training is 
retrieved (class mean, rotation matrix and mean square distances of 
the training objects) and used to find normalized distances of the test 
objects to the various subspaces. Finally they are converted to a density, 
assuming a normal distribution of distances. These values are returned in 
a data set using the setdata routine. This data set is thereby similar to 
the input data set: same object labels, object identifiers, etc. Just the data 
matrix itself is changed and the columns now refer to classes instead of 
features. The definition of the mappings and data sets form the core of 
the PRTools toolbox. There are numerous supporting algorithms for 
inspecting and visualizing classifiers and classfication results. Also the 
possibilities for combining classifiers is not discussed. In the file 
Contents.m (which can be inspected in MATLAB by help prtools) 
the most important commands are listed. A final possibility is to inspect 
all *.m. files and use help. 


Appendix F 


MATLAB Toolboxes Used 


Apart from the PRTools toolbox,’ this book has used the following 
standard MATLAB toolboxes: 


e Control System Toolbox 
This toolbox contains many functions and data structures for the 
modelling and design of control systems. However, the linear time 
invariant cases are emphasized. As such, the toolbox contains a num- 
ber of functions that are of interest for state estimation. But, unfortu- 
nately, the scope of these functions is limited to the steady state case. 

e Signal Processing Toolbox 
This toolbox is a collection of processing operations. Much atten- 
tion is paid to the realization aspect of filters, i.e. how to build a 
filter with given properties, e.g. a given cut-off frequencies, con- 
strained phase characteristics and so on. This aspect is of less 
importance for the design of state estimators. Nevertheless, the 
toolbox contains a number of functions that are of interest with 
respect to the scope of this book: 


— correlation and covariance estimation 
— spectral analysis, e.g. the periodogram 





1 And with PRTools, implicitly the Neural Network Toolbox. 
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— estimation of parameters of random processes, e.g. the AR coef- 
ficients based on, for instance, the Yule-Walker equations 

— some various tools, such as waveform generation and resampling 
functions. 


Optimization Toolbox 

This toolbox contains many functions for optimization problems 
including constrained or unconstrained, linear or nonlinear mini- 
mization. The toolbox also contains functions for curve fitting. 
Statistics Toolbox 

This toolbox includes functions for the following topics (among 
others): 


o many distribution and density functions of random variables and 
corresponding random number generators 

parameter estimation for some standard distributions 

linear and nonlinear regression methods 

methods for principal components analysis 

functions for the analysis of hidden Markov models. 


O O- ©. 0 


System Identification Toolbox 

This toolbox contains various methods for the identification of 
dynamic systems. Many models are available, but nevertheless the 
emphasis is on linear time invariant systems. 


MATLAB offers many more interesting toolboxes, e.g. for the analysis of 
financial time series (Financial Time Series Toolbox; Garch Toolbox), 
and for calibration and curve fitting (Curve Fitting Toolbox, Model- 
based Calibration Toolbox), but these toolboxes are not used in this 
book. 


Index 


Acceptance boundary, 297 minimum distance, 30 
Algorithm minimum error rate, 24, 33 
backward, 122 nearest neighbour, 155, 312 
condensation, 131 Parzen density-based, 150, 312 
forward, 116 perceptron, 164 
forward—backward, 123 quadratic, 27, 311 
Viterbi, 125 support vector, 168, 316 
ARIMA, 264 Clustering, 226 
ARMA, 264 average-link, 229 
Autoregressive model, 264 characteristics, 216 
first order, 91, 341 complete-link, 229 
second order, 92, 137 hierarchical, 228 
Autoregressive, moving average K-means, 228 
models, 137 quality, 227 
single-link, 229 
Back-propagation training, 175 Completely 
Baseline removal, 259 controllable, 269 
Batch processing, 166 observable, 272 
Bayes estimation, 142 Computational complexity, 178 
Bayes’ theorem, 7, 20, 48 Computational issues, 253 
Bayesian classification, 16, 21 Condensing, 159, 160 
Bhattacharyya upper bound, 192 Confusion matrix, 178 
Bias, 63, 142, 332 Consistency checks, 292, 
Binary measurements, 148 296, 342 
Branch-and-bound, 197 Control vector, 89 
Controllability matrix, 269 
Chernoff bound, 192 Cost 
Chi-square test, 346 absolute value, 50 
Classifier function, 19, 33, 50 
Bayes, 8, 33 matrix, 19 
Euclidean distance, 147 quadratic, 50 
feed-forward neural network, 173, 317 uniform, 35, 50 
least squared error, 166 Covariance, 63 
linear, 29, 311 Covariance model (CVM) based 
linear discriminant function, 162 estimator, 331 
Mahalanobis distance, 147 Covariance models, 327 


maximum a posteriori (MAP), 14, 35 Cross-validation, 180, 312, 332 
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Curve 
calibration, 350 
fitting, 324 


Decision boundaries, 27, 150 
Decision function, 17 
Dendrogram, 230 
Depth-first search, 198 
Design set, 140 
Detection, 35 
Differencing, 301 
Discrete 
algebraic Ricatti equation, 
98 
Lyapunov equation, 90, 274 
Ricatti equation, 271 
Discriminability, 39, 41 
Discriminant function, 162 
generalized linear, 163 
linear, 162 
Dissimilarity, 226 
Distance 
Bhattacharyya, 191, 204 
Chernoff, 186, 191 
cosine, 226 
Euclidean, 226 
inter/intraclass, 186, 209 
interclass, 189 
intraclass, 189 
Mahalanobis, 205 
Matusita, 194 
probabilistic, 194 
Distribution 
Gamma, 49 
matrix, 89 
test, 297 
Drift, 261 
Dynamic stability, 270 


Editing, 159 

Entropy, 195 

Envelope detection, 323 
Ergodic Markov model, 114 
Error correction, 166 

Error covariance matrix, 98 
Error function, 39 

Error rate, 24, 33 


Estimation 
maximum a posteriori 
(MAP), 50, 51 


maximum likelihood, 57 


minimum mean absolute error 


(MMAE), 50, 51 


INDEX 


minimum mean squared error 
(MMSE), 50, 51 
minimum variance, 55 
Estimation loop, 277 
Evaluation, 263 
Evaluation set, 177 
Expectation-maximization, 235 
Experiment design, 258 


Feature, 14 
Feature extraction, 185 
Feature reduction, 216 
Feature selection, 185 
generalized sequential forward, 
200 
Plus] — take away r, 200 
selection of good components, 330 
sequential forward, 200 
Feed-forward neural network, 173 
Fisher approach, 57 
Fisher’s linear discriminant, 213 
Fudge factor, 300 


Gain matrix, 89 

Generative topographic mapping, 
246 

Goodness of fit, 346 

Gradient ascent, 164 


Hidden data, 235 

Hidden Markov model (HMM), 113 
Hidden neurons, 174 

Hill climbing algorithm, 235 
Histogramming, 150 

Holdout method, 179 


ii.d., 140 
Identification of linear systems, 264 
Image compression, 219 
Importance sampling, 128, 131 
Incomplete data, 235 
Indicator variables, 236 
Information 

filter, 269, 283 

matrix, 67, 284 
Innovation(s), 62, 293 

matrix, 98 
Input vector, 89 


Kalman 
filtering, 254 
form, 62, 280 
linear-Gaussian MMSE form, 280 


INDEX 


Kalman filter, 97 
discrete, 97, 98 
extended, 105, 343 
iterated extended, 108 
linearized, 101, 341 
unscented, 112 

Kalman gain matrix, 62, 98 

Kernel, 151 
Gaussian, 171 
polynomial, 171 
radial basis function (RBF), 

171 

trick, 171 

K-nearest neighbour rule, 

157 
Kohonen map, 241 


Lagrange multipliers, 170 
Latent variable, 246 
Learning, 139 
least squared error, 166 
nonparametric, 149-50 
parametric, 142 
perceptron, 164 
supervised, 139 
unsupervised, 139 
Learning data, 140 
Learning rate, 164 
Least squared error (LSE), 68 
Leave-one-out method, 180 
Left-right model, 114 
Level estimation, 339 
Likelihood 
function, 36, 57 
ratio, 36 
Linear 
dynamic equation, 89 
plant equation, 89 
state equation, 89 
system equation, 89 
Linear feature extraction, 
202 
Linear feedback, 102 
Linear-Gaussian system, 89 
Log-likelihood, 262 
Loss function, 19 


Mahalanobis distance, 28 

Mahalanobis distance 
classifier, 213 

Margin, 169 

Markov condition, 83, 114 

Matched filtering, 326 


MATLAB functions 
HMM analysis, 127 
particle filtering, 112 
steady state Kalman filtering, 
137, 268, 270, 273 
Maximum likelihood, 326, 327 
Mean square error, 64 
Measure 
divergence, 194 
Matusita, 194 
Shannon’s entropy, 195 
Measurement 
matrix, 96 
model, 86 
noise, 96 
space, 14, 86 
vector, 14, 163 
Minimum error rate, 158 
Minimum mean squared 
error, 217 
Minimum risk classification, 
21 
Missing data, 235 
Mixture 
of Gaussians, 228 
of probabilistic PCA, 240 
Model selection, 263 
Models, 83 
Monte Carlo simulation, 128 
Moving average models 
first order, 137 
Multi-dimensional scaling, 
220 
Multi-edit algorithm, 160 


Nearest neighbour rule, 157 
Neuron, 173 
Noise, 48, 75 
autocorrelated, 300 
cross-correlated, 303 
measurement, 96 
plant, 89 
process, 89 
quantization, 17 
quantum, 17 
system, 89 
thermal, 17, 96 
Nominal trajectory, 105 
Nonlinear mapping, 221 
Normalized 
estimation error squared, 294 
importance weights, 130 
innovation squared, 295, 342 
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Observability, 253, 266 

complete, 266 

Gramian, 267 

matrix, 267 

stochastic, 269 
Offset correction, 259 
Online estimation, 82 
Optimal filtering, 82 
Optimization criterion, 185 
Outlier clusters, 228 
Outliers, 72 
Overfitting, 76, 177, 184, 263 


Parameter vector, 48 
Partial autocorrelation 
function, 265 
Particle filter, 112, 128, 344 
consistency criterion, 345 
implementation, 347 
Particles, 128-9 
Parzen estimation, 150 
Perceptron, 165 
Periodogram, 297 
Place coding, 169 
Posterior, 48 
prior, 48 
Potter’s square root filter, 287 
Predicted measurement, 98 
Prediction, 82, 94 
fixed interval, 95 
fixed lead, 95 
Principal 
component analysis, 216, 313, 329 
components, 218 
directions, 218 
Principle of orthogonality, 294 
Probabilistic dependence, 194 
Probability 
posterior, 20 
prior, 16-17 
Probability density 
conditional, 17, 36, 48, 83, 86 
posterior, 86 
unconditional, 17 
Proposal density, 129 


Quadratic decision function, 27 
Quantization errors, 96 


Random walk, 92 
Rauch-Tung-Striebel smoother, 304 
Regression, 74, 321, 351 

curve, 75 


Regularization, 146 
parameter, 146 
Reject rate, 33 
Rejection, 32 
class, 32 
Resampling by selection, 130 
Residual(s), 68, 75, 293 
Retrodiction, 82 
Ricatti loop, 277 
Risk, 21 
average, 21, 51 
conditional, 20, 23, 51 
Robust error norm, 72 
Robustness, 65 
ROC-curve, 40 
Root mean square (RMS), 337 


Sammon mapping, 224 
Sample 
covariance, 145 
mean, 143 
Scatter diagram, 15 
Scatter matrix, 188 
between-scatter matrix, 188 
within-scatter matrix, 188 
Self-organizing map, 241 
Sensor, 258 
location, 258 
Sensory system, 13 
Sequential update, 283 
Signal-to-noise ratio, 38, 190 
Single sample processing, 166 
Smoothing, 303 
Square root filtering, 287 
Stability, 65, 253, 270 
State space model, 81, 83 
State 
augmentation, 120 
estimation 
offline, 120 
online, 117 
mixed, 128 
variable, 81 
continuous, 81, 88 
discrete, 81, 113 
Statistical linearization, 112 
Steady state, 98, 270 
Steepest ascent, 164 
Stochastic observability, 269 
Stress measure, 221 
Subspace 
dimension, 240 
structure, 216 


INDEX 


Sum of squared differences 
(SSD), 68, 324 
Support vector, 170 
System identification, 
253, 254, 256 


Target vector, 166, 175 

Test set, 177 

Time-of-flight estimation, 319 
Topology, 241 

Training, 139 

Training set, 140 

Transfer function, 173 
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True class, 140 

Unbiased, 64 
absolutely, 64 

Unit cost, 23 


Validation set, 177 
Variance, 142 


Winning neuron, 242 
Wishart distribution, 144 


Yule-Walker equations, 264 


